So as the question suggests, I am looking for a command similar to the "copy" command of redhsifts which allows me to load csv data stored in an S3 buckets directly into a AWS RDS Mysql table ( it's not aurora).
How do I do that?
Just migrate your database to aurora, then use LOAD DATA INFILE FROM S3 's3://<your-file-location>/<filename>.csv'
Careful though if you are using big CSV files - you'll have to manage timeouts and tune your instance to have a fast write capacity.
Related
I had a question regarding migrating large data form my local machine to AWS RDS (Aurora DB). Basically I have local mysql database that has couple of tables with around 4GB of data. I need to replicate this data in AWS RDS. The approach I was thinking was to make INSERT call to the RDS but with this huge amount of data (32 million rows), the process would be costly. I did see some resources on exporting data from local and importing it in RDS but could not quite understand how it works. Does someone have a good idea about this and advice me on what would be the best process. PS: the data only exist on local machine and not in any servers.
Dump a CSV extract into S3 then use an AWS migration tool, I.e. see: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html
I know that AWS Glue can connect to an Amazon RDS with a MySQL Database engine. However, I want to use Glue extract data files from S3, transform them, and then load them to a specific MySQL table rather than loading the data into the entire database. Is there a way that this can be done?
Since you can use ( scala | python ) + spark for Glue ETL job, look like you can just flush a transformed dataframe to the database. Something like this
I am looking for options to archive my old data from specific tables of an AWS RDS MySQL database.
I came across AWS S3, AWS Glacier and copy the data to either one using some Pipelines or Buckets, but from what I understood they copy the data to vault or backups the data, but don't move them.
Is there a proper option to archive the data by moving from RDS to S3 or Glacier or Deep Archive? i.e., deleting from the table in AWS RDS after creating an archive.
What would be the best option for the archival process with my requirements and would it affect the replicas that already exist?
The biggest consideration when "archiving" the data is ensuring that it is in a useful format should you every want it back again.
Amazon RDS recently added that ability to export RDS snapshot data to Amazon S3.
Thus, the flow could be:
Create a snapshot of the Amazon RDS database
Export the snapshot to Amazon S3 as a Parquet file (you can choose to export specific sets of databases, schemas, or tables)
Set the Storage Class on the exported file as desired (eg Glacier Deep Archive)
Delete the data from the source database (make sure you keep a Snapshot or test the Export before deleting the data!)
When you later wish to access the data:
Restore the data if necessary (based upon Storage Class)
Use Amazon Athena to query the data directly from Amazon S3
Recently I did build a similar pipeline using AWS lambda that runs on a cron schedule(Cloudwatch event) every month to take a manual snapshot of the RDS, export it to S3, and delete the records that are older than n days
I added a gist of the util class that I used, adding it here if it helps anyone
JS Util class to create and export Db snapshots to S3
PS: I just wanted to add it as a comment to the approved answer but don't have enough reputations for that.
Since AWS Aurora does not support the RDS free tier (it does not have Micro instance support), I am using a MySQL server instead.
I have a script that generates data (currently in XML) that can be imported into MySQL, then writes it to an S3 bucket. I was intending to use the LOAD XML FROM S3 command like in this answer to import it from the bucket, but I get a syntax error when I try.
I've looked at AWS Data Pipelines, but it seems hard to maintain since, from what I can tell, it only supports CSV, and I would have to edit the SQL query to import the lines manually whenever the structure of the database changes. This is an advantage of XML; LOAD XML gets the column names from the file, not the query used.
Does the AWS MySQL (not Aurora) support importing from S3? Or do I have to generate the XML, write it locally and to the bucket, then use LOAD XML LOCAL INFILE on the local file?
There are multiple limitations while importing data in RDS from S3 as mentioned in the official documentation. Check if any of the below applies to you.
Limitations and Recommendations for Importing Backup Files from Amazon S3 to Amazon RDS The following are some limitations and
recommendations for importing backup files from Amazon S3:
You can only import your data to a new DB instance, not an existing
DB instance.
You must use Percona XtraBackup to create the backup of your on-premises database.
You can't migrate from a source database that has tables defined
outside of the default MySQL data directory.
You can't import a MySQL 5.5 or 8.0 database.
You can't import an on-premises MySQL 5.6 database to an Amazon RDS
MySQL 5.7 or 8.0 database. You can upgrade your DB instance after
you complete the import.
You can't restore databases larger than the maximum database size
supported by Amazon RDS for MySQL. For more information about
storage limits, see General Purpose SSD Storage and Provisioned IOPS
SSD Storage.
You can't restore from an encrypted source database, but you can
restore to an encrypted Amazon RDS DB instance.
You can't restore from an encrypted backup in the Amazon S3 bucket.
You can't restore from an Amazon S3 bucket in a different AWS Region
than your Amazon RDS DB instance.
Importing from Amazon S3 is not supported on the db.t2.micro DB
instance class. However, you can restore to a different DB instance
class, and then change the instance class later. For more
information about instance classes, see Hardware Specifications for
All Available DB Instance Classes.
Amazon S3 limits the size of a file uploaded to an Amazon S3 bucket
to 5 TB. If a backup file exceeds 5 TB, then you must split the
backup file into smaller files.
Amazon RDS limits the number of files uploaded to an Amazon S3
bucket to 1 million. If the backup data for your database, including
all full and incremental backups, exceeds 1 million files, use a
tarball (.tar.gz) file to store full and incremental backup files in
the Amazon S3 bucket.
User accounts are not imported automatically. Save your user
accounts from your source database and add them to your new DB
instance later.
Functions are not imported automatically. Save your functions from
your source database and add them to your new DB instance later.
Stored procedures are not imported automatically. Save your stored
procedures from your source database and add them to your new DB
instance later.
Time zone information is not imported automatically. Record the
time
zone information for your source database, and set the time zone of
your new DB instance later. For more information, see Local Time
Zone for MySQL DB Instances.
Backward migration is not supported for both major versions and
minor versions. For example, you can't migrate from version 5.7 to
version 5.6, and you can't migrate from version 5.6.39 to version
5.6.37.
I have created a Relational Database (MySQL) hosted on Amazon Web Services. What I would like to do next is, import the data in my local CSV files into this database. I would really appreciate if someone provides me an outline on how to go about it.Thanks!
This is easiest and most hands-off by using MySQL command line. For large loads, consider spinning up a new EC2 instance, installing MySQL CL tools, and transferring your file to that machine. Then, after connecting to your database via CL, you'd do something like:
mysql> LOAD DATA LOCAL INFILE 'C:/upload.csv' INTO TABLE myTable;
Also options to match your file's details and ignore header (plenty more in the docs)
mysql> LOAD DATA LOCAL INFILE 'C:/upload.csv' INTO TABLE myTable FIELDS TERMINATED BY ','
ENCLOSED BY '"' IGNORE 1 LINES;
If you're hesitant to use CL, download MySQL Workbench. It connects no prob to AWS RDS.
Closing thoughts:
MySQL LOAD DATA Docs
AWS' Aurora RDS is MySQL-compatible so command works there too
"LOCAL" flag actually transfers the file from your client machine (where you're running the command) to the DB server. Without LOCAL, the file must be on the DB server (not possible to transfer it there in advance with RDS)
Works great on huge files too! Just sent a 8.2GB file via this method (260 million rows). Took just over 10 hours from a t2-medium EC2 to db.t2.small Aurora
Not a solution if you need to watch out for unique keys or read the CSV row-by-row and change the data before inserting/updating
I did some digging and found this official AWS documentation on how to import data from any source to MySQL hosted on RDS.
It is a very detailed step by step guide and icludes an explanation on how to import CSV files.
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.AnySource.html
Basically, each table must have its own file. Data for multiple tables cannot be combined in the same file. Give each file the same name as the table it corresponds to. The file extension can be anything you like. For example, if the table name is "sales", the file name could be "sales.csv" or "sales.txt", but not "sales_01.csv".
Whenever possible, order the data by the primary key of the table being loaded. This drastically improves load times and minimizes disk storage requirements.
There is another option to import data to MySQL database, you can use an external tool Alooma that can do the data import for you in real time.
Depending on how large is your file, but if it is under 1GB I found that DataGrip imports smaller files without any issues: https://www.jetbrains.com/datagrip/
You get nice mapping tool and graphical IDE to play around. DataGrip is available as a trial for 30 days free.
I am experiencing myself RDS connection dropouts with bigger files like > 2GB. Not sure if it is about the DataGrip or AWS side.
I think your best bet would be to develop a script in your language of choice to connect to the database and import it.
If your database is internet accessible then you can run that script locally. If it is in a private subnet then you can either run that script on an EC2 instance with access to the private subnet or on lambda connected to your VPC. You should really only use lambda if you expect runtime to be less than 5 minutes or so.
Edit: Note that lambda only supports a handful of languages
AWS Lambda supports code written in Node.js (JavaScript), Python, Java
(Java 8 compatible), and C# (.NET Core).