Copy and unzip from S3 to HDFS - json

I have a few large zip files on S3. Each of these zip files contains several gz files, which contain data in JSON format. I need to (i) Copy the gz files to HDFS and (ii) Process the files preferably by Apache Spark/Impala/Hive. What is the easiest/best way of going about it?

1) Try distcp for copying files from s3 to HDFS
2) For processing, use "org.apache.spark.sql.hive.HiveContext"'s read.json for reading JSON data from HDFS and create dataframe.
Then do any operation on it.
Follow this link,
http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

Related

Convert file format while copying data using copy activity in Azure data factory

I am performing a copy activity to bring in data into the Azure data lake using Azure data factory. The file format is compressed(.gz) format.
I want to copy those files but want to change the format to .json instead of copying in the same original format(the .gz file contains inside a .json file).
Is there a mechanism to get this done in Azure data factory? I want to perform this because in further ETL process i will face issues with .gz format.
Any help would be great. Thank you.
Step1: Create Copy Activity.
Step2: Select .gz file as Source
Step3: Select gzip(.gz) as Compression type and Compression level as Optimal.
Step4: Select Sink as blob storage and run pipeline.
This will unzip your .gz file.

Ingest multiple excel files to MySQL using query

I am trying to load data from excel files into a table in MySql. There are 400 excel files in .xlsx format.
I have successfully ingested one file into the table but the problem is that involves manually converting excel file into a csv file, saving it on a location and then running a query to load using LOAD LOCAL INFILE. How to do it for rest of the files.
How to load all the 400 .xlsx files in a folder without converting them manually to .csv files and then running the ingestion query one by one on them.Is there a way in MySql to do that. For example, any FOR Loop that goes through all the files and ingest them in the table.
Try bulk converting your XLSXs into CSVs using in2csv as found in csvkit.
## single file
in2csv file.xlsx > file.csv
## multiple files
for file in *.xlsx; do in2csv $file > $file.csv; done
Then import data into MySQL using LOAD LOCAL INFILE...
From loading multiple CSVs, use for file in *.csv; do... or see How to Import Multiple csv files into a MySQL Database.

Can we merge .CSV file and .RAR file in hive (Hadoop tools)?

Can you suggest how we can do merging of different types of files ?
Merging of different types of files cannot be accomplished. Each file type has their own way of compression and storing data.
RAR file on the other hand is not usually used in Hadoop. If there are other formats like, parquet, orc, json - these can be merged by converting the files to the same type.
For example if the requirement is to merge parquet and json files, the parquet files can be converted into json using tools like parquet-tools.jar and can be merged by creating tables by loading these files into a table with appropriate schema.
Hope this helps!

How to read the last modified csv files from S3 bucket?

I come to you to find out if you have a pro tips for loading the latest csv files generated by a Glue job into an S3 bucket to load into jupyter notebook.
I use this command to load my csv from an S3 folder. Is there an option to select only files with the last modified csv files ?
df = sqlContext.read.csv(
's3://path',
header=True, sep=","
)
Before I had a tendency to transform my dynamic dataframe into a classic dataframe to overwrite the old files generated by my Glue job.
This is not possible by generating a DyF
Thank you
You can use S3 boto3 api to get csv files with last modified date, then sort them, filter them and pass it to Glue or Spark read api.
Alternatively, you can use AWS S3 Inventory and query over athena: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
There is Job Bookmark concept in Glue but it is for newly added files and not modified files.

Possible to load one gzip file of muliple CSVs in Redshift

I am trying to load a compressed file which contain multiple CSV files into Redshift. I followed AWS documentation Loading Compressed Data Files from Amazon S3. However, I am not sure if I will be able to do following:
I have multiple CSV files for a table:
table1_part1.csv
table1_part2.csv
table1_part3.csv
I compressed these three files into one table1.csv.gz.
Can I load this gzip file into Redshift table using COPY command?
No you cannot; but using copy command you can give a folder name (containing all zip files) or wild card .. So just dont zip them into one file but independent files will work fine.
You could achieve by creating a Menifest file that should have link of all your CSV files and just specify the Menifest file in your copy command like-
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
See Menifest at end.
For more details refer Amazon Red-Shift Documentation. Section "Using a Manifest to Specify Data Files".