How Amazon Glue can process large input files job - json

I am creating an ETL automation pipeline using resources provided by Amazon
such as Amazon Glue for data transformation.
When I passed 1 MB JSON file, it transforms data successfully & provides an output in required JSON format.
I did R&D on how Amazon Glue processes larger files (2 GB), but didn't find the expected results. Could you please let me know if you have any idea/references regarding the same issue?
I am using PySpark custom script to do transformation stuff.

Related

Json Format for azure cosmos graph db bulk import

we are planning to migrate our db to Azure cosmos graph db. we are using this
bulk import tool.
nowhere it mentioned Json input format.
Whats the Json format for bulk import to Azure cosmos graph db
https://github.com/Azure-Samples/azure-cosmosdb-graph-bulkexecutor-dotnet-getting-started
azure bulk import image
Appreciate any help.
You actually don't need to build the gremlin queries to insert your edges. In CosmosDB, everything is regarded as a JSON document (even the vertices and edges in a graph collection).
The format of the required JSON isn't officially published and can change at any time but can be discovered though inspection of the SDKs.
I wrote about it here a while ago and it is still valid today.

Importing more than 1 glue catalog tables into a redshift table

I have more than 1 file in S3 which I would like to import into Redshift. The command line for copying is giving me incomprehensible errors. So I went and used AWS Glue crawler to get the files into my Glue catalog. I then created a connection for Redshift. I used a Glue Job to ingest the data into Redshift.
I was able to get the data from a file clicks_001.json in S3 into a Redshift table clicks. That worked. But the problem is I have 1000s of such files and I want to get all of them into the same Redshift table.
I tried passing in parameters to the job but was not able to read the parameters in the arguments. I was thinking that I can use the SDK to start the job one by one for each table in the catalog. I think it is a bug in AWS Glue and I logged a bug https://forums.aws.amazon.com/thread.jspa?threadID=272398&tstart=0.
I understand that AWS Glue is a wrapper on top of Spark. In spark we can read files like s3://files-dir/my_file-*.json. I looked but could not find a way to read data like that. Any suggestion about how to put multiple files in S3 into Redshift?

Big ( 1GB) JSON data handling in Tableau

I am working with a large twitter dataset in the form of a JSON file. When I try to import that into Tableau, there is an error and the upload fails on the account of data upload limit of 128 MB.
Due to which I need to shrink the dataset to bring it to 128MB thereby reducing the effectiveness of the analysis.
What is the best way to upload and handle large JSON data in tableau?
Do I need to use an external tool for it?
Can we use AWS products to handle the same? Please advise!
From what I can find in unofficial documents online, Tableau does indeed have a 128 MB limit on JSON file size. You have several options.
Split the JSON files into multiple files and union them in your data source (https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_json.html#Union)
Use a tool to convert the JSON to csv or Excel (Google for JSON to csv converter)
Load the JSON into a database, such as MySql and use the MySql as the data source
You may want to consider posting in the Ideas section of the Tableau Community pages and add a suggestion for allowing larger JSON files. This will bring it to the attention of the broader Tableau community and product management.

Merits of JSON vs CSV file format while writing to HDFS for downstream applications

We are in the process of extracting source data (xls) and injecting to HDFS. Is it better to write these files as CSV or JSON format? We are contemplating choosing one of them, but before making the call, we are wondering what are the merits & demerits of using either one of them.
Factors we are trying to figure out are:
Performance (Data Volume is 2-5 GB)
Loading vs Reading Data
How much easier it is to extract Metadata (Structure) info from either of these files.
Injected data will be consumed by other applications which support both JSON & CSV.

Fastest way to download/access files in Amazon S3

I'm trying to access files in my amazon S3 and do some operations on it. Currently evaluating the options.
Since I will be doing some operations on the S3 files, I would prefer using some language to access the files in S3 (I have already tried copy command).
My S3 contains JSON files which range between 2MB to 4 MB and I would need to parse these JSON and load them into a database (thinking about using JQuery here, but any other suggestions are welcome)
Given these requirements which is most efficient language/platform to be used here.
You options are pretty broad here. AWS has a list of SDKs for you to choose from. https://aws.amazon.com/tools/#sdk
So your comfort level with a particular language should be your largest influencer. Given that you mentioned JSON and JQuery perhaps you should look at Node.js SDK and AWS Lambda.