Conversion of GRIB and NetCDF to my database - csv

I have downloaded "High Resolution Initial Conditions" climate forecast data for one day, it was in extension .tar.gz so I extracted it in my local directory and I get the files like in the attached image. I think, that the files without extension are GRIB data (because first word in them is "GRIB"). So I want to get data from the big files (GRIB and NetCDF formats containing climate data like temerature & pressure in grid) to my database, but they are binary. Can you recommend me some easy way for getting data from these files? I can't get any information about handling their datasets on their website.
Converting these files to .csv would be nice, but I can't find a program to convert the GRIB files.

Using python and some available modules it is simple...
The Enthought Python Distribution includes several packages, including netCDF4, to deal with NetCDF files!
I've never worked with GRIB files, but google tells that another python package exists, pygrib2.
Or you can use PyNio, a Python package that allows to read and write netCDF3 and netCDF4 classic format, and to read GRIB1 and GRIB2 files.
I don't know the ammount of data you have, but usually it is crazy to convert it to *.csv! Python is easy to learn, and suitable to work with this kind of data (with matplotlib package you can even plot it). Or, if you really need it in a *.csv, you can select with python a smaller domain, for example, or the needed variables...

For conversion into text, look into http://www.cpc.ncep.noaa.gov/products/wesley/wgrib.html or http://www.cpc.ncep.noaa.gov/products/wesley/wgrib2/
Both are C programs from one of the big names in GRIB.

I'm currently dealing with a similar issue.
In my case I'm trying to rely on the GrADS software, which can "easily" transform GRIB data into other formats.
If your dataset is not huge, then you can export it to csv using this tutorial.
My dataset is 80gb in GRIB binary files, so I'm very restricted in what software I can use to handle it (no R unless I find a computer with more than 80gb of RAM).

Related

How to analyze a data set directly from website without first downloading it?

I am learning Analysing data for my research. There is a website that contains all the data set for every day for the past 26 years. I have to write a python code such that if I enter the date, the data set for that day should open. Since the files are in .cdf format I have to use python to open them. Can someone tell me what are the things that I need to learn and what are the libraries that will help me to open the data set from the website without first downloading them? I have some experience with python but not a lot.
Also is there any good source that I can visit to learn more about Data Analysis using python?
You can use Pandas for this matter.
Pandas lets you store files on a dataframe using a link, without downloading it to your local machine.

Can I export all of my JSON documents of a collection to a CSV in Marklogic?

I have millions of documents in different collections in my database. I need to export them to a csv onto my local storage when I specify the collection name.
I tried mlcp export but didn't work. We cannot use corb for this because of some issues.
I want the csv to be in such a format that if I try a mlcp import then I should be able to restore all docs just the way they were.
My first thought would be to use MLCP archive feature, and to not export to a CSV at all.
If you really want CSV, Corb2 would be my first thought. It provides CSV export functionality out of the box. It might be worth digging into why that didn't work for you.
DMSDK might work too, but involves writing code that handles the writing of CSV, which sounds cumbersome to me.
Last option that comes to mind would be Apache NiFi for which there are various MarkLogic Processors. It allows orchestration of data flow very generically. It could be rather overkill for your purpose though.
HTH!
ml-gradle has support for exporting documents and referencing a transform, which can convert each document to CSV - https://github.com/marklogic-community/ml-gradle/wiki/Exporting-data#exporting-data-to-csv .
Unless all of your documents are flat, you likely need some custom code to determine how to map a hierarchical document into a flat row. So a REST transform is a reasonable solution there.
You can also use a TDE template to project your documents into rows, and the /v1/rows endpoint can return results as CSV. That of course requires creating and loading a TDE template, and then waiting for the matching documents to be re-indexed.

Efficiently Aggregate Many CSVs in Spark

Pardon my simple question but I'm relatively new to Spark/Hadoop.
I'm trying to load a bunch of small CSV files into Apache Spark. They're currently stored in S3, but I can download them locally if that simplifies things. My goal is to do this as efficiently as possible. It seems like it would be a shame to have some single-threaded master downloading and parsing a bunch of CSV files while my dozens of Spark workers sit idly. I'm hoping there's an idiomatic way to distribute this work.
The CSV files are arranged in a directory structure that looks like:
2014/01-01/fileabcd.csv
2014/01-01/filedefg.csv
...
I have two years of data, with directories for each day, and a few hundred CSVs inside of each. All of those CSVs should have an identical schema, but it's of course possible that one CSV is awry and I'd hate for the whole job to crash if there are a couple problematic files. Those files can be skipped as long as I'm notified in a log somewhere that that happened.
It seems that every Spark project I have in mind is in this same form and I don't know how to solve it. (e.g. trying to read in a bunch of tab-delimited weather data, or reading in a bunch of log files to look at those.)
What I've Tried
I've tried both SparkR and the Scala libraries. I don't really care which language I need to use; I'm more interested in the correct idioms/tools to use.
Pure Scala
My original thought was to enumerate and parallelize the list of all year/mm-dd combinations so that I could have my Spark workers all processing each day independently (download and parse all CSV files, then stack them on top of eachother (unionAll()) to reduce them). Unfortunately, downloading and parsing the CSV files using the spark-csv library can only be done in the "parent"/master job, and not from each child as Spark doesn't allow job nesting. So that won't work as long as I want to use the Spark libraries to do the importing/parsing.
Mixed-Language
You can, of course, use the language's native CSV parsing to read in each file then "upload" them to Spark. In R, this is a combination of some package to get the file out of S3 followed by a read.csv, and finishing off with a createDataFrame() to get the data into Spark. Unfortunately, this is really slow and also seems backwards to the way I want Spark to work. If all my data is piping through R before it can get into Spark, why bother with Spark?
Hive/Sqoop/Phoenix/Pig/Flume/Flume Ng/s3distcp
I've started looking into these tailored tools and quickly got overwhelmed. My understanding is that many/all of these tools could be used to get my CSV files from S3 into HDFS.
Of course it would be faster to read my CSV files in from HDFS than S3, so that solves some portion of the problem. But I still have tens of thousands of CSVs that I need to parse and am unaware of a distributed way to do that in Spark.
So right now (Spark 1.4) SparkR has support for json or parquet file structures. Csv files can be parsed, but then the spark context needs to be started with an extra jar (which needs to be downloaded and placed in the appropriate folder, never done this myself but my collegues have).
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
There is more information in the docs. I expect that a newer spark release would have more support for this.
If you don't do this you'll need to either resort to a different file structure or use python to convert all your files from .csv into .parquet. Here is a snippet from a recent python talk that does this.
data = sc.textFile(s3_paths, 1200).cache()
def caster(x):
return Row(colname1 = x[0], colname2 = x[1])
df_rdd = data\
.map(lambda x: x.split(','))\
.map(caster)
ddf = sqlContext.inferSchema(df_rdd).cache()
ddf.write.save('s3n://<bucket>/<filename>.parquet')
Also, how big is your dataset? You may not even need spark for analysis. Note that also as of right now;
SparkR has only DataFrame support.
no distributed machine learning yet.
for visualisation you will need to convert a distributed dataframe back into a normal one if you want to use libraries like ggplot2.
if your dataset is no larger than a few gigabytes, then the extra bother of learning spark might not be worthwhile yet
it's modest now, but you can expect more from the future
I've run into this problem before (but w/ reading a large qty of Parquet files) and my recommendation would be to avoid dataframes and to use RDDs.
The general idiom used was:
Read in a list of the files w/ each file being a line (In the driver). The expected output here is a list of strings
Parallelize the list of strings and map over them with a customer csv reader. with the return being a list of case classes.
You can also use flatMap if at the end of the day you want a data structure like List[weather_data] that could be rewritten to parquet or a database.

Importing thousands of text files into database

I am pretty new to databases and need help. I have n (large) files, each file contains m (very large) text file (numeric data). What is the best way to import those files into a mysql database concerning the names of the fields?
usually one would write script with perl (or whatever scripting language is preferred, offering MySQL Support ) and process files one after other, applying necessary processing to files / lines inside files. If you like more specific answer, ask more specific question
If you only need to do it once, or the import process remains fairly similar each time, I would recommend using the ETL software Kettle by Pentaho (this bit of software is commonly referred to as kettle). While this software is far from perfect, I've found that I can often import data in a fraction of the time I would have to spend writing a script for one specific file. You can select a text file input and specify the delimiters, fixed width, etc.. and then simply export directly into your SQL server (they support MySql, SQLite, Oracle, and much more).
If you would like to research other types of software like this, its often referred to as ETL software, short for Extract Transform Load.
If your familiar with python, I would also recommend the last post on this page

Is there any free tool to convert a file with more than 65000 registers from DBF format to CSV?

I need to convert a very large file from DBF format to CSV format. I have tried Microsoft Excel to do the job, but the problem is that I cannot see more than 65500 registers when I open and export the file.
Microsoft Access couldn't open the file, too.
I have found on google some shareware tools, searching for "DBF to CSV". Have you tried any of these with very large files?
Also, any solution that could export to mysql or postgresql database formats will be welcome.
Thanks in advance for your responses, best regards,
https://github.com/SocialExplorer/FastDBF
"Also included here is a small utility that reads DBF files and outputs CSV files! "
go to http://www.the-oasis.net/ftpmaster.php3?content=ftputils.htm
look for this file dbx130.zip
Bytes: 125,478 Date: 1993-03-22
dbMAX is an xBASE utility that will allow complete multi-user access
to any xBASE databases and indexes. The program uses a CUA-type menu
system with Brief(R)-style hot keys and can browse databases in up to
250 moveable, sizable windows. Almost every Clipper(R)/dBASE(R)
command is available, allowing dbMAX to replace the dBASE
Assist/Control Center or Computer Associates' DBU utility. dbMAX also
has a partially open architecture, allowing programmers to create
their own menus and operate on dbMAX internal data structures.
this utility has a dos ui but it allows you via the Copy function on the menu to export entire dbf tables in SDF or CSV format. I personally know that it can handle a file with 3.8 million rows so it should be able to handle your table.
Use OpenOffice - Its free and can handle a lot of rows. With that many rows, you might need to split the file and then convert the pieces and then reassemble.
OpenOffice 3.0 Calc maxes out at 65K rows. I tried importing a large DBF into OpenOffice 3.0 Base but it handed the job off to Calc :-(
Alternative: if you have Python 2.4 to 2.6, I can send you a copy of my soon-to-go-public DBF-reading module plus a DBF-to-CSV script. To get my e-mail address, search for "John Machin xlrd" [xlrd is my Excel XLS-reading package].