Merits of JSON vs CSV file format while writing to HDFS for downstream applications - json

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format? We are contemplating choosing one of them, but before making the call, we are wondering what are the merits & demerits of using either one of them.
Factors we are trying to figure out are:
Performance (Data Volume is 2-5 GB)
Loading vs Reading Data
How much easier it is to extract Metadata (Structure) info from either of these files.
Injected data will be consumed by other applications which support both JSON & CSV.

Related

Qlikview and Qliksense VS MSBI

This question can be seen as very stupid, but i'm actually strugling to make it clear into my head.
I have some academic experience with SSIS, SSAS and SSRS.
In simple terms:
SSIS - Integration of data from a data source to a data destination;
SSAS - Building a cube of data, which allows to analize and discover the data.
SSRS - Allows the data source to create dashboards with charts, etc...
Now, doing a comparison with Qlikview and Qliksense...
Can the Qlik products do exactly the same as SSIS, SSAS, SSRS? Like, can Qlik products do the extraction(SSIS), data proccessing(SSAS) and data visualization(SSRS)? Or it just works more from a SSRS side (creating dashboards with the data sources)? Does the Qlik tools do the ETL stages (extract, transform and load) ?
I'm really struggling here, even after reading tons of information about it, so any clarifications helps ALOT!
Thanks,
Anna
Yes. Qlik (View and Sense) can be used as ETL tool and presentation layer. Each single file (qvw/View and qvf/Sense) contains the script that is used for ETL (load all the required data from all data sources, transform the data if needed), the actual data and the visuals.
Depends on the complexity, only one file can be used for everything. But the process can be organised in multiple files as well (if needed). For example:
Extract - contains the script for data extract (eventually with incremental load implemented if the data volumes are big) and stores the data in qvd files
Transform - loads the qvd files from the extraction process (qvd load is quite fast) and perform the required transformations
Load - load the data model from the transformation file (binary load) and create the visualisations
Another example of multiple files - had a project which required multiple extractors and multiple transformation files. Because the data was extracted from multiple data sources to speed up the process we've ran all the extractors files are the same time, then ran all the transformation files at the same time, then the main transform (which combined all the qvd files) into a single data model.
In addition to the previous comment have a look at the layered Qlik architecture.
There it is described quite well how you should structure your files.
However, I would not recommend to use Qlik for a full-blown data-warehouse (which you could do with SSIS easily) as it lacks some useful functions (e.g. helpers for slowly-changing-dimensions).

Big ( 1GB) JSON data handling in Tableau

I am working with a large twitter dataset in the form of a JSON file. When I try to import that into Tableau, there is an error and the upload fails on the account of data upload limit of 128 MB.
Due to which I need to shrink the dataset to bring it to 128MB thereby reducing the effectiveness of the analysis.
What is the best way to upload and handle large JSON data in tableau?
Do I need to use an external tool for it?
Can we use AWS products to handle the same? Please advise!
From what I can find in unofficial documents online, Tableau does indeed have a 128 MB limit on JSON file size. You have several options.
Split the JSON files into multiple files and union them in your data source (https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_json.html#Union)
Use a tool to convert the JSON to csv or Excel (Google for JSON to csv converter)
Load the JSON into a database, such as MySql and use the MySql as the data source
You may want to consider posting in the Ideas section of the Tableau Community pages and add a suggestion for allowing larger JSON files. This will bring it to the attention of the broader Tableau community and product management.

Efficiently Aggregate Many CSVs in Spark

Pardon my simple question but I'm relatively new to Spark/Hadoop.
I'm trying to load a bunch of small CSV files into Apache Spark. They're currently stored in S3, but I can download them locally if that simplifies things. My goal is to do this as efficiently as possible. It seems like it would be a shame to have some single-threaded master downloading and parsing a bunch of CSV files while my dozens of Spark workers sit idly. I'm hoping there's an idiomatic way to distribute this work.
The CSV files are arranged in a directory structure that looks like:
2014/01-01/fileabcd.csv
2014/01-01/filedefg.csv
...
I have two years of data, with directories for each day, and a few hundred CSVs inside of each. All of those CSVs should have an identical schema, but it's of course possible that one CSV is awry and I'd hate for the whole job to crash if there are a couple problematic files. Those files can be skipped as long as I'm notified in a log somewhere that that happened.
It seems that every Spark project I have in mind is in this same form and I don't know how to solve it. (e.g. trying to read in a bunch of tab-delimited weather data, or reading in a bunch of log files to look at those.)
What I've Tried
I've tried both SparkR and the Scala libraries. I don't really care which language I need to use; I'm more interested in the correct idioms/tools to use.
Pure Scala
My original thought was to enumerate and parallelize the list of all year/mm-dd combinations so that I could have my Spark workers all processing each day independently (download and parse all CSV files, then stack them on top of eachother (unionAll()) to reduce them). Unfortunately, downloading and parsing the CSV files using the spark-csv library can only be done in the "parent"/master job, and not from each child as Spark doesn't allow job nesting. So that won't work as long as I want to use the Spark libraries to do the importing/parsing.
Mixed-Language
You can, of course, use the language's native CSV parsing to read in each file then "upload" them to Spark. In R, this is a combination of some package to get the file out of S3 followed by a read.csv, and finishing off with a createDataFrame() to get the data into Spark. Unfortunately, this is really slow and also seems backwards to the way I want Spark to work. If all my data is piping through R before it can get into Spark, why bother with Spark?
Hive/Sqoop/Phoenix/Pig/Flume/Flume Ng/s3distcp
I've started looking into these tailored tools and quickly got overwhelmed. My understanding is that many/all of these tools could be used to get my CSV files from S3 into HDFS.
Of course it would be faster to read my CSV files in from HDFS than S3, so that solves some portion of the problem. But I still have tens of thousands of CSVs that I need to parse and am unaware of a distributed way to do that in Spark.
So right now (Spark 1.4) SparkR has support for json or parquet file structures. Csv files can be parsed, but then the spark context needs to be started with an extra jar (which needs to be downloaded and placed in the appropriate folder, never done this myself but my collegues have).
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
There is more information in the docs. I expect that a newer spark release would have more support for this.
If you don't do this you'll need to either resort to a different file structure or use python to convert all your files from .csv into .parquet. Here is a snippet from a recent python talk that does this.
data = sc.textFile(s3_paths, 1200).cache()
def caster(x):
return Row(colname1 = x[0], colname2 = x[1])
df_rdd = data\
.map(lambda x: x.split(','))\
.map(caster)
ddf = sqlContext.inferSchema(df_rdd).cache()
ddf.write.save('s3n://<bucket>/<filename>.parquet')
Also, how big is your dataset? You may not even need spark for analysis. Note that also as of right now;
SparkR has only DataFrame support.
no distributed machine learning yet.
for visualisation you will need to convert a distributed dataframe back into a normal one if you want to use libraries like ggplot2.
if your dataset is no larger than a few gigabytes, then the extra bother of learning spark might not be worthwhile yet
it's modest now, but you can expect more from the future
I've run into this problem before (but w/ reading a large qty of Parquet files) and my recommendation would be to avoid dataframes and to use RDDs.
The general idiom used was:
Read in a list of the files w/ each file being a line (In the driver). The expected output here is a list of strings
Parallelize the list of strings and map over them with a customer csv reader. with the return being a list of case classes.
You can also use flatMap if at the end of the day you want a data structure like List[weather_data] that could be rewritten to parquet or a database.

Where to find GTFS realtime file

I have been doing extensive research on GTFS and GTFS-Realtime. All I want to be able to do, is find out how late a certain bus would be. I can't seem to find where I can connect to, to properly search for a specific bus number. So my questions are:
Where/ how can I find the GTFS-Realtime file feed
How can I properly open the file, and make it location specific.
I've been trying to use http://www.yrt.ca/en/aboutus/GTFS.asp to download the file, but can't figure out how to open the csv file properly.
According to What is GTFS-realtime?, the GTFS-realtime data is not in CSV format. Instead, it is based on Protocol Buffers:
Data format
The GTFS-realtime data exchange format is based on Protocol Buffers.
Protocol buffers are a language- and platform-neutral mechanism for serializing structured data (think XML, but smaller, faster, and simpler). The data structure is defined in a gtfs-realtime.proto file, which then is used to generate source code to easily read and write your structured data from and to a variety of data streams, using a variety of languages – e.g. Java, C++ or Python.

Loading large CSV file to MySQL with Django and transformations

I have a large CSV file (5.4GB) of data. It's a table with 6 columns a lot of rows. I want to import it into MySQL across several tables. Additionally I have to do some transformations to the data before import (e.g. parse a cell, and input the parts into several table values etc.). Now I can either do a script does a transformation and inserts a row at a time but it will take weeks to import the data. I know there is the LOAD DATA INFILE for MySQL but I am not sure how or if I can do the needed transformations in SQL.
Any advice how to proceed?
In my limited experience you won't want to use the Django ORM for something like this. It will be far too slow. I would write a Python script to operate on the CSV file, using Python's csv library. And then use the native MySQL facility LOAD DATA INFILE to load the data.
If the Python script to massage the CSV file is too slow you may consider writing that part in C or C++, assuming you can find a decent CSV library for those languages.