I am trying to load a compressed file which contain multiple CSV files into Redshift. I followed AWS documentation Loading Compressed Data Files from Amazon S3. However, I am not sure if I will be able to do following:
I have multiple CSV files for a table:
table1_part1.csv
table1_part2.csv
table1_part3.csv
I compressed these three files into one table1.csv.gz.
Can I load this gzip file into Redshift table using COPY command?
No you cannot; but using copy command you can give a folder name (containing all zip files) or wild card .. So just dont zip them into one file but independent files will work fine.
You could achieve by creating a Menifest file that should have link of all your CSV files and just specify the Menifest file in your copy command like-
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
See Menifest at end.
For more details refer Amazon Red-Shift Documentation. Section "Using a Manifest to Specify Data Files".
Related
We are ingesting a zip file from blob storage with Azure Data Factory. Within this zipped file there are multiple csv files which need to be copied to Azure SQL tables.
For example, lets say zipped.zip contains Clients.csv, Accounts.csv, and Users.csv. Each of these csv files will need to copied to a different destination table - dbo.Client, dbo.Account, and dbo.User.
It seems that I would need to use a Copy Activity to unzip and copy these files to a staging location, and from there copy those individual files to their respective tables.
Just want to confirm that there is not a different way to unzip and copy all in one action, without using a staging location? Thanks!
Your thoughts are correct, there is no direct way without using a staging aspect as long as you are not writing any custom code logic and leveraging just the Copy activity
a somewhat similar thread: https://learn.microsoft.com/en-us/answers/questions/989436/unzip-only-csv-files-from-my-zip-file.html
I imported a text file from in GCS and did some preparations using DataPrep and write them back to GCS as CSV files. What I want to do is, do this for all the text files in that bucket Is there a way to do this for all the files in that bucket(in GCS) at once?
Below is my procedure. I selected a textfile from GCS(can't select more than one text file) and did some preparations(rename columns .create new columns and etc). Then write it back to GCS as CSV.
You can use the Dataset with parameters feature to load several files at once.
You can then use a wildcard to select all the files that you want to load.
Note that all the files need to have the same schema (same columns) for this to work.
See https://cloud.google.com/dataprep/docs/html/Create-Dataset-with-Parameters_118228628 for more information on how to use this feature.
An other solution is to add all the files into a folder* and to use the large + button to load all the files in that folder.
[*] technically under the same prefix on GCS
I am trying to load data from excel files into a table in MySql. There are 400 excel files in .xlsx format.
I have successfully ingested one file into the table but the problem is that involves manually converting excel file into a csv file, saving it on a location and then running a query to load using LOAD LOCAL INFILE. How to do it for rest of the files.
How to load all the 400 .xlsx files in a folder without converting them manually to .csv files and then running the ingestion query one by one on them.Is there a way in MySql to do that. For example, any FOR Loop that goes through all the files and ingest them in the table.
Try bulk converting your XLSXs into CSVs using in2csv as found in csvkit.
## single file
in2csv file.xlsx > file.csv
## multiple files
for file in *.xlsx; do in2csv $file > $file.csv; done
Then import data into MySQL using LOAD LOCAL INFILE...
From loading multiple CSVs, use for file in *.csv; do... or see How to Import Multiple csv files into a MySQL Database.
I have a folder with 620 file I need to load all to neo4j with one load command is this is possible
You can concatenate all the files in that folder into a single file (e.g., cat * >all_data.csv) and load that single file.
The import-tool supports reading from multiple CSV source files, even using a regex pattern to select them. See http://neo4j.com/docs/operations-manual/current/#import-tool
I have a few large zip files on S3. Each of these zip files contains several gz files, which contain data in JSON format. I need to (i) Copy the gz files to HDFS and (ii) Process the files preferably by Apache Spark/Impala/Hive. What is the easiest/best way of going about it?
1) Try distcp for copying files from s3 to HDFS
2) For processing, use "org.apache.spark.sql.hive.HiveContext"'s read.json for reading JSON data from HDFS and create dataframe.
Then do any operation on it.
Follow this link,
http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes