AWS MySQL to GCP BigQuery data migration - mysql

I'm planning a Data Migration from AWS MySQL instances to GCP BigQuery. I don't want to migrate every MySQL Database because finally I want to create a Data Warehouse using BigQuery.
Would exporting AWS MySQL DB to S3 buckets as csv/json/avro, then transfer to GCP buckets be a good option? What would be the best practices for this Data pipeline?

If this was a MySQL to MySQL migration; there were other possible options. But in this case the option you mentioned is perfect.. Also, remember that your MySQL database will keep getting updated.. So, your destination DB might have some records missed out.. because it is not real-time DB transfer.

Your proposal of exporting to S3 files should work OK, and to export the files you can take advantage of the AWS Database Migration Service
With that service you can do either a once-off export to S3, or an incremental export with Change Data Capture. Unfortunately, since BigQuery is not really designed for working with changes on its tables, implementing CDC can be a bit cumbersome (although totally doable). You need to take into account the cost of transferring data across providers.
Another option, which would be much easier for you, is to use the same AWS Database Migration service to move data directly to Amazon Redshift.
In this case, you would get change data capture automatically, so you don't need to worry about anything. And RedShift is an excellent tool to build your data warehouse.
If you don't want to use RedShift for any reason, and you prefer a fully serverless solution, then you can easily use AWS Glue Catalog to read from your databases and export to AWS Athena.
The cool thing about the AWS based solutions is everything is tightly integrated, you can use the same account/users for billing, IAM, monitoring... and since you are moving data within a single provider, there is no extra charge for networking, no latency, and potentially fewer security issues.

Related

Mirroring homogeneous data from one MySQL RDS to another MySQL RDS

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

AWS - How to find all AWS products that interact with a MySQL DB?

Recently I was given a responsibility on some product running at AWS. This product includes several web-services running on EC2 instances and several databases (MySQL, DynamoDB etc').
Since there is barely any documentation, I'm still trying to wrap my head around the architecture of the data flow. Do I have a way, given a known MySQL DB - for example, to map out its' interactions with other AWS resources? For instance: some EC2 tries to write / read data from the database; A lambda is pushing data from that database to S3, and so on.
I've tried looking at cloudtrail logs and found them pretty insufficient. Is there another way?
I would prefer if this could be done through awscli obviously, but a GUI solution could also work if it exists.

What's the most efficient way to transfer data from one AWS RDS instance to another

I am working for a client who uses multiple RDS (MySQL) instances on AWS and wants me to consolidate data from there and other sources into a single instance and do reporting off that.
What would be the most efficient way to transfer selective data from other AWS RDS MySQL instances to mine?
I don't want to migrate the entire DB, rather just a few columns and rows based on which have relevant data and what was last created/updated.
One option would be to use a PHP script that'd read from one DB and insert it into another, but it'd be very inefficient. Unlike SQL Server or ORACLE, MySQL also does not have the ability to write queries across servers, else I'd have just used that in a stored procedure.
I'd appreciate any inputs regarding this.
If your overall objective is reporting and analytics, the standard practice is to move your transactional data from RDS to Redshift which will become your data warehouse. This blog article by AWS provides an approach to do it.
For the consolidation operation, you can use AWS Data Migration Service which will allow you to migrate data column wise with following options.
Migrate existing data
Migrate existing data & replicate ongoing changes
Replicate data changes only
For more details read this whitepaper.
Note: If you need to process the data while moving, use AWS Data Pipeline.
Did you take a look at the RDS migration tool?

Amazon RDS database across application environments

I have three different application environments: production, demo, and dev. In each, I have an RDS instance running MySQL. I have five tables that house data that needs to be the same across all environments. I am trying to find a way to handle this.
For security purposes, it's not best to allow demo and dev to access the production database, so putting the data there seems to be a bad idea.
All environments need read/write capabilities. Is there a good solution to this?
Many thanks.
For security purposes, it's not best to allow demo and dev to access the production database, so putting the data there seems to be a bad idea.
Agreed. Do not have your demo/dev environments access data from your production environments.
I don't know your business logic, but I cannot think of a case where dev/demo data needs to be "in sync" with production data, unless the dev/demo environment is also dependent on other "production assets". If that were the case, I would suggest duplicating that data into your other environments.
Usually, the data in your database would be dependent on the environment it's contained within.
For best security and separation of concerns, keep your environment segregated as much as possible. This includes (but not limited to):
database data,
customer data,
images and other files
If data needs to be synchronized, create a script/program to perform that synchronization completely (db + all necessary assets). But do that as part of your normal development pipeline so it goes through dev+testing+qa etc.
So the thing about RDS and database level access is that you still would manage the user credentials like you would on premise. From an AWS perspective all you would need to do to allow access is update the security groups of your Mysql RDS instances to allow the traffic, then give your application the credentials you have provisioned for it. I do agree it is bad practice to give production level access to your dev or demo environments.
As far as the data being the same you can automate a nightly snapshot of the Production database and recreate new instances based on that. If your infrastructure is in Cloudformation or Terraform you can provide the new endpoint created in the snapshot and spin up a new DEV or DEMO environment.
Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. You can create a DB instance by restoring from this DB snapshot. When you restore the DB instance, you provide the name of the DB snapshot to restore from, and then provide a name for the new DB instance that is created from the restore. You cannot restore from a DB snapshot to an existing DB instance; a new DB instance is created when you restore.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html
I would recommend using a fan out system at the point of data capture, along with a snapshot.
Take a point in time snap shot (i.e. now), spin up test/dev databases from this, and then use SQS->SNS->SQS fan out architecture to push any new changes to the data to your other databases?

Right database for machine learning on 100 TB of data

I need to perform classification and clustering on about 100tb of web data and I was planning on using Hadoop and Mahout and AWS. What database do you recommend I use to store the data? Will MySQL work or would something like MongoDB be significantly faster? Are there other advantages of one database or the other? Thanks.
The simplest and most direct answer would be to just put the files directly in HDFS or S3 (since you mentioned AWS) and point Hadoop/Mahout directly at them. Other databases have different purposes, but Hadoop/HDFS is designed for exactly this kind of high-volume, batch-style analytics. If you want a more database-style access layer, then you can add Hive without too much trouble. The underlying storage layer would still be HDFS or S3, but Hive can give you SQL-like access to the data stored there, if that's what you're after.
Just to address the two other options you brought up: MongoDB is good for low-latency reads and writes, but you probably don't need that. And I'm not up on all the advanced features of MySQL, but I'm guessing 100TB is going to be pretty tough for it to deal with, especially when you start getting into large queries that access all of the data. It's more designed for traditional, transactional access.