I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?
Related
I'm planning a Data Migration from AWS MySQL instances to GCP BigQuery. I don't want to migrate every MySQL Database because finally I want to create a Data Warehouse using BigQuery.
Would exporting AWS MySQL DB to S3 buckets as csv/json/avro, then transfer to GCP buckets be a good option? What would be the best practices for this Data pipeline?
If this was a MySQL to MySQL migration; there were other possible options. But in this case the option you mentioned is perfect.. Also, remember that your MySQL database will keep getting updated.. So, your destination DB might have some records missed out.. because it is not real-time DB transfer.
Your proposal of exporting to S3 files should work OK, and to export the files you can take advantage of the AWS Database Migration Service
With that service you can do either a once-off export to S3, or an incremental export with Change Data Capture. Unfortunately, since BigQuery is not really designed for working with changes on its tables, implementing CDC can be a bit cumbersome (although totally doable). You need to take into account the cost of transferring data across providers.
Another option, which would be much easier for you, is to use the same AWS Database Migration service to move data directly to Amazon Redshift.
In this case, you would get change data capture automatically, so you don't need to worry about anything. And RedShift is an excellent tool to build your data warehouse.
If you don't want to use RedShift for any reason, and you prefer a fully serverless solution, then you can easily use AWS Glue Catalog to read from your databases and export to AWS Athena.
The cool thing about the AWS based solutions is everything is tightly integrated, you can use the same account/users for billing, IAM, monitoring... and since you are moving data within a single provider, there is no extra charge for networking, no latency, and potentially fewer security issues.
I am working for a client who uses multiple RDS (MySQL) instances on AWS and wants me to consolidate data from there and other sources into a single instance and do reporting off that.
What would be the most efficient way to transfer selective data from other AWS RDS MySQL instances to mine?
I don't want to migrate the entire DB, rather just a few columns and rows based on which have relevant data and what was last created/updated.
One option would be to use a PHP script that'd read from one DB and insert it into another, but it'd be very inefficient. Unlike SQL Server or ORACLE, MySQL also does not have the ability to write queries across servers, else I'd have just used that in a stored procedure.
I'd appreciate any inputs regarding this.
If your overall objective is reporting and analytics, the standard practice is to move your transactional data from RDS to Redshift which will become your data warehouse. This blog article by AWS provides an approach to do it.
For the consolidation operation, you can use AWS Data Migration Service which will allow you to migrate data column wise with following options.
Migrate existing data
Migrate existing data & replicate ongoing changes
Replicate data changes only
For more details read this whitepaper.
Note: If you need to process the data while moving, use AWS Data Pipeline.
Did you take a look at the RDS migration tool?
I have three different application environments: production, demo, and dev. In each, I have an RDS instance running MySQL. I have five tables that house data that needs to be the same across all environments. I am trying to find a way to handle this.
For security purposes, it's not best to allow demo and dev to access the production database, so putting the data there seems to be a bad idea.
All environments need read/write capabilities. Is there a good solution to this?
Many thanks.
For security purposes, it's not best to allow demo and dev to access the production database, so putting the data there seems to be a bad idea.
Agreed. Do not have your demo/dev environments access data from your production environments.
I don't know your business logic, but I cannot think of a case where dev/demo data needs to be "in sync" with production data, unless the dev/demo environment is also dependent on other "production assets". If that were the case, I would suggest duplicating that data into your other environments.
Usually, the data in your database would be dependent on the environment it's contained within.
For best security and separation of concerns, keep your environment segregated as much as possible. This includes (but not limited to):
database data,
customer data,
images and other files
If data needs to be synchronized, create a script/program to perform that synchronization completely (db + all necessary assets). But do that as part of your normal development pipeline so it goes through dev+testing+qa etc.
So the thing about RDS and database level access is that you still would manage the user credentials like you would on premise. From an AWS perspective all you would need to do to allow access is update the security groups of your Mysql RDS instances to allow the traffic, then give your application the credentials you have provisioned for it. I do agree it is bad practice to give production level access to your dev or demo environments.
As far as the data being the same you can automate a nightly snapshot of the Production database and recreate new instances based on that. If your infrastructure is in Cloudformation or Terraform you can provide the new endpoint created in the snapshot and spin up a new DEV or DEMO environment.
Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. You can create a DB instance by restoring from this DB snapshot. When you restore the DB instance, you provide the name of the DB snapshot to restore from, and then provide a name for the new DB instance that is created from the restore. You cannot restore from a DB snapshot to an existing DB instance; a new DB instance is created when you restore.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html
I would recommend using a fan out system at the point of data capture, along with a snapshot.
Take a point in time snap shot (i.e. now), spin up test/dev databases from this, and then use SQS->SNS->SQS fan out architecture to push any new changes to the data to your other databases?
Currently we have some microservice, they have their own database model and migration what provided by GORM Golang package. We have a big old MySQL database which is against the microservices laws, but we can't replace it. Im afraid when the microservices numbers start to growing, we will be lost in the many database model. When I add a new column in a microservice I just type service migrate to the terminal (because there is a cli for run and migrate commands), and it is refresh the database.
What is the best practice to manage it. For example I have 1000 microservice, noone will type the service migrate when someone refresh the models. I thinking about a centralized database service, where we just add a new column and it will store all the models with all migration. The only problem, how will the services get to know about database model changes. This is how we store for example a user in a service:
type User struct {
ID uint `gorm:"column:id;not null" sql:"AUTO_INCREMENT"`
Name string `gorm:"column:name;not null" sql:"type:varchar(100)"`
Username sql.NullString `gorm:"column:username;not null" sql:"type:varchar(255)"`
}
func (u *User) TableName() string {
return "users"
}
Depending on your use cases, MySQL Cluster might be an option. Two phase commits used by MySQL Cluster make frequent writes impractical, but if write performance isn't a big issue then I would expect MySQL Cluster would work out better than connection pooling or queuing hacks. Certainly worth considering.
If I'm understanding your question correctly, you're trying to still use one MySQL instance but with many microservices.
There are a couple of ways to make an SQL system work:
You could create a microservice-type that handles data inserts/reads from the database and take advantage of connection pooling. And have the rest of your services do all their data read/writes through these services. This will definitely add a bit of extra latency to all your writes/reads and likely be problematic at scale.
You could attempt to look for a multi-master SQL solution (e.g. CitusDB) that scales easily and you can use a central schema for your database and just make sure to handle edge cases for data insertion (de-deuping etc.)
You can use data-streaming architectures like Kafka or AWS Kinesis to transfer your data to your microservices and make sure they only deal with data through these streams. This way, you can de-couple your database from your data.
The best way to approach it in my opinion is #3. This way, you won't have to think about your storage at the computation layer of your microservice architecture.
Not sure what service you're using for your microservices, but StdLib forces a few conversions (e.g. around only transferring data through HTTP) that helps folks wrap their head around it all. AWS Lambda also works very well with Kinesis as a source to launch the function which could help with the #3 approach.
Disclaimer: I'm the founder of StdLib.
If I understand your question correctly it seems to me that there may be multiple ways to achieve this.
One solution is to have a schema version somewhere in the database that your microservices periodically check. When your database schema changes you can increase the schema version. As a result of this if a service notices that the database schema version is higher than the current schema version of the service it can migrate the schema in the code which gorm allows.
Other options could depend on how you run your microservices. For example if you run them using some orchestration platform (e.g. Kubernetes) you could put the migration code somewhere to run when your service initializes. Then once you update the schema you can force a rolling refresh of your containers which would in turn trigger the migration.
I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.