Big ( 1GB) JSON data handling in Tableau - json

I am working with a large twitter dataset in the form of a JSON file. When I try to import that into Tableau, there is an error and the upload fails on the account of data upload limit of 128 MB.
Due to which I need to shrink the dataset to bring it to 128MB thereby reducing the effectiveness of the analysis.
What is the best way to upload and handle large JSON data in tableau?
Do I need to use an external tool for it?
Can we use AWS products to handle the same? Please advise!

From what I can find in unofficial documents online, Tableau does indeed have a 128 MB limit on JSON file size. You have several options.
Split the JSON files into multiple files and union them in your data source (https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_json.html#Union)
Use a tool to convert the JSON to csv or Excel (Google for JSON to csv converter)
Load the JSON into a database, such as MySql and use the MySql as the data source
You may want to consider posting in the Ideas section of the Tableau Community pages and add a suggestion for allowing larger JSON files. This will bring it to the attention of the broader Tableau community and product management.

Related

Convert thousands of small JSON files from s3 to one big CSV in lambda

I am trying to merge multiple small JSON files (about 500,000 files of 400-500 byte size and are no longer susceptible to change) into one big CSV file, using AWS Lambda. I have a job that works something like this:
Use the s3.listobjects() to fetch keys
Use s3.getObject() to fetch each JSON file (is there a better way to do this?)
Create a CSV file in-memory (what's the best way to do this in nodejs?)
Upload that file in S3
I'd love to know if there's a better way to go about doing this. Thanks!
I would recommend using Amazon Athena.
It allows you to run SQL commands across multiple data files simultaneously (including JSON) and can create output files by Creating a Table from Query Results (CTAS) - Amazon Athena.

Express node sending file to DB or file upload

I am new to Express/Node environment and not aware of API and functionalites provided. We have Express 4 in our project and need to add feature where there will be couple file upload buttons.
We thinking of storing files to DB ( SQL Server) table instead of file systems.
I experimented some examples and was able to upload file to file system ( using express-file-upload module).
Now I want to try DB table which is desired way for our team and want to know best way per our needs.
I see options are
-busboy module
- multer
- simple file/path modules to open/read files and insert queries ( I am trying this method but don'nt know if it will work)
Please suggest right approach.
With same file upload button, now file will be stored in table. Table columns are
file id
blob,
creation time
user id
Any ideas or suggestions .
Thanks
I would recommend using multer as it abstracts much of the complexity of file uploads.
You can certainly store file blob in your database, but I wouldn't do it. This will make your database very large and backups could take long. See this stackexchange answer.
I highly recommend storing on the server itself or better, use S3. By using S3 you get all the benefits of AWS and a super low price. Then you store the S3 key in your database.

Merits of JSON vs CSV file format while writing to HDFS for downstream applications

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format? We are contemplating choosing one of them, but before making the call, we are wondering what are the merits & demerits of using either one of them.
Factors we are trying to figure out are:
Performance (Data Volume is 2-5 GB)
Loading vs Reading Data
How much easier it is to extract Metadata (Structure) info from either of these files.
Injected data will be consumed by other applications which support both JSON & CSV.

Ingesting MySQL data to GeoMesa analytics

I am new to GeoMesa. I mean I just typed geomesa command. So, after following the command line tools tutorial on GeoMesa website. I found some information on ingesting data to geomesa through a .csv file.
So, for my research:
I have a MySQL database storing all the information sent from an Android Application.
And I want to perform some geo spatial analytics on it.
Right now I am converting my MySQL table to .csv file and then ingest it into geomesa as adviced on GeoMesa website.
But my questions are:
Is there any other better option because data is in GB and its a streaming data, hence I have to make .csv file regularly?
Is there any API through which I can connect my MySQL database to geomesa?
Is there any way to ingest using .sql dump file because that would be more easier then .csv file?
Since you are dealing with streaming data, I'd point to two GeoMesa integrations:
First, you might want to check out NiFi for managing data flows. If that fits into your architecture, then you can use GeoMesa with NiFi.
Second, Storm is quite popular for working with streaming data. GeoMesa has a brief tutorial for Storm here.
Third, to ingest sql dumps directly, one option would be to extend the GeoMesa converter library to support them. So far, we haven't had that as a feature request from a customer or a contribution to the project. It'd definitely be a sensible and welcome extension!
I'd also point out the GeoMesa gitter channel. It can be useful for quicker responses.

Where to find GTFS realtime file

I have been doing extensive research on GTFS and GTFS-Realtime. All I want to be able to do, is find out how late a certain bus would be. I can't seem to find where I can connect to, to properly search for a specific bus number. So my questions are:
Where/ how can I find the GTFS-Realtime file feed
How can I properly open the file, and make it location specific.
I've been trying to use http://www.yrt.ca/en/aboutus/GTFS.asp to download the file, but can't figure out how to open the csv file properly.
According to What is GTFS-realtime?, the GTFS-realtime data is not in CSV format. Instead, it is based on Protocol Buffers:
Data format
The GTFS-realtime data exchange format is based on Protocol Buffers.
Protocol buffers are a language- and platform-neutral mechanism for serializing structured data (think XML, but smaller, faster, and simpler). The data structure is defined in a gtfs-realtime.proto file, which then is used to generate source code to easily read and write your structured data from and to a variety of data streams, using a variety of languages – e.g. Java, C++ or Python.