AWS RDS showing outdated data for Multiple AZ instance (MySQL) - mysql

My RDS instance was showing outdated data temporarily.
I ran a SELECT query on my data. I then ran a query to delete data from a table and another to add new data to the table. I ran a SELECT query and it was showing the old data.
I ran the SELECT query AGAIN and THEN it finally showed me the new data.
Why would this happen? I never had these issues locally or on my normal non AZ instances. Is there a way to avoid this happening?
I am running MySQL 5.6.23

According to the Amazon RDS Multi-AZ FAQs, this might be expected.
Specifically this:
You may observe elevated latencies relative to a standard DB Instance deployment in a single Availability Zone as a result of the synchronous data replication performed on your behalf.
Of course, it depends on the frequency of the delays you're observing and what is the increased latency you're seeing, but an option would be to contact AWS support in case the issue is frequently reproducible.

As embarrassing as this is... it was an issue in our Spring Java code and not AWS.
A method modified a database entity object. The method itself wasn't transactional but was called from a transactional context which would persist any changes on entities to the database.
It looked like it was rolling back changes, but what it was doing was just overwriting data. My guess is it overwrote the data a while ago so until someone tried to modify it we just assumed it was the correct data.

Related

Mirroring homogeneous data from one MySQL RDS to another MySQL RDS

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

ClearDB - does it take time to read the newest data?

I just used mySQL workbench to connect to my clearDB account which is connected to an azure web app. The problem is even thought I ran a query that drops/creates tables in the newly made schema that mirrors exactly the tables and data in my previous live server, I go to mysite.azurewebsites.com/wp-admin and the error is in establishing data connection. Site could not be found. Check if your database contains the following pages: wp_blogs, ..........
What could be the problem? Does this process just need a bit of time to propagate all the data?
EDIT: something to note, which might be a factor, when I ran the last query, it also included dropping/adding the table "wp_users" so all previous data was wiped and replaced with the info from a previous live server.
Normally you will see any changes made immediately. But because your database is hosted on a geoseparated cluster in circular replication there are some rare circumstances where this might not be true.
Specifically, if your delete/write went to one master and your read query went to another. Data propagation is normally immediate but if one of the nodes is offline or the system is unusually busy there can be a delay.

Backing up DynamoDB tables via data pipeline vs manually creating a json for dynamoDB

I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.

How to perform targeted select queries on main DB instance when using Amazon MySQL RDS and Read replica?

I'm considering to use Amazon MySQL RDS with Read Replicas. The only thing disturbing me is Replica Lag and eventual inconsistency. For example, image the case when user modifies his profile (UPDATE will be performed on main DB instance) and then refreshes the page to see changed info (SELECT might be performed from Replica which has not received changes yet due to Replica Lag).
By accident, I found Amazon article which mentions its possible to perform targeted queries. For me it sounds like we can add some parameter or other to tell Amazon to execute select on the main DB instance instead of on Replica. The example with user profile is quite trivial but the same problem occurs in more realistic cases, for example checkout, when a user performs several steps and he needs to see updated info on then next screens. Yes, application could cache entire data set on its own, however it would be great if anybody knows how to perform targeted queries on main DB instance.
I read the link you referenced and didn't find any mention of "target" or anything like that.
But this line might be what you're referring to:
Otherwise, you should spread out the load and read from one of the
Read Replicas. You can make this decision on a query-by-query basis
within your application. You will probably want to maintain some sort
of registry of available Read Replicas within your application,
choosing from among them on a round-robin or randomly distributed
basis.
If so, then I interpret that line to suggest that you can balance reads in your application by just picking one server from a pool and hitting that one. But it would be all in your application logic.

RDS Read Replica Considerations

We hired an intern and want to let him play around with our data to generate useful reports. Currently we just took a database snapshot and created a new RDS instance that we gave him access to. But that is out of date almost immediately due to changes on the production database.
What we'd like is a live (or close-to-live) mirror of our actual database that we can give him access to without worrying about him modifying any real data or accidentally bringing down our production database (eg by running a silly query like SELECT (*) FROM ourbigtable or a really slow join).
Would a read replica be suitable for this purpose? It looks like it would at least be staying up to date but I'm not clear what would happen if a read replica went down or if data was accidentally changed on it or any other potential liabilities.
The only thing I could find related to this was this SO question and this has me a bit worried (emphasis mine):
If you're trying to pre-calculate a lot of data and otherwise modify
what's on the read replica you need to be really careful you're not
changing data -- if the read is no longer consistent then you're in
trouble :)
TL;DR Don't do it unless you really know what you're doing and you
understand all the ramifications.
And bluntly, MySQL replication can be quirky in my experience, so even
knowing what is supposed to happen and what does happen if there's as
the master tries to write updated data to slave you've also
updated.... who knows.
Is there any risk to the production database if we let an intern have at it on an unreferenced read replica?
We've been running read-replicas of our production databases for a couple years now without any significant issues. All of our sales, marketing, etc. people who need the ability to run queries are provided access to the replica. It's worked quite well and has been stable for the most part. The production databases are locked down so that only our applications can connect to it, and the read-replicas are accessible only via SSL from our office. Setting up the security is pretty important since you would be creating all the user accounts on the master database and they'd then get replicated to the read-replica.
I think we once saw a read-replica get into a bad state due to a hardware-related issue. The great thing about read-replicas though is that you can simply terminate one and create a new one any time you want/need to. As long as the new replica has the exact same instance name as the old one its DNS, etc. will remain unchanged, so aside from being briefly unavailable everything should be pretty much transparent to the end users. Once or twice we've also simply rebooted a stuck read-replica and it was able to eventually catch up on its own as well.
There's no way that data on the read-replica can be updated by any method other than processing commands sent from the master database. RDS simply won't allow you to run something like an insert, update, etc. on a read-replica no matter what permissions the user has. So you don't need to worry about data changing on the read-replica causing things to get out of sync with the master.
Occasionally the replica can get a bit behind the production database if somebody submits a long running query, but it typically catches back up fairly quickly once the query completes. In all our production environments we have a few monitors set up to keep an eye on replication and to also check for long running queries. We make use of the pmp-check-mysql-replication-delay command in the Percona Toolkit for MySQL to keep an eye on replication. It's run every few minutes via Nagios. We also have a custom script that's run via cron that checks for long running queries. It basically parses the output of the "SHOW FULL PROCESSLIST" command and sends out an e-mail if a query has been running for a long period of time along with the username of the person running it and the command to kill the query if we decide we need to.
With those checks in place we've had very little problem with the read-replicas.
The MySQL replication works in a way that what happens on the slave has no effect on the master.
A replication slave asks for a history of events that happened on the master and applies them locally. The master never writes anything on the slaves: the slaves read from the master and do the writing themselves. If the slave fails to apply the events it read from the master, it will stop with an error.
The problematic part of this style of data replication is that if you modify the slave and later modify the master, you might have a different value on the slave than on the master. This can be avoided by turning on the global read_onlyvariable.