Google Compute Engine disk snapshot stuck at creating - google-compute-engine

I took a snapshot of a 50GB volume (non-boot) which is attached to an instance. The snapshot was successful.
I shut the instance and tried taking another snapshot of the same volume. This time the command hung. gcloud status reflects "CREATING" for this attempt. It is hours since I started the snapshot command. I tried the same using google developers console. The behaviour remains the same.
I restarted the instance and the status of the snapshot changes to "READY" within seconds.
It seems that snapshots should be taken if the volume is attached to a running instance. Otherwise the command is queued and executed when the volume/instance is live. Is this expected behaviour?

I replicated your issue and indeed the snapshot process halts when the instance is shut down. You may have also noticed that now the Shutdown/Start feature has been introduced - it was not available before.
I believe this is due to how snapshots are being handled on the platform. Your first snapshot creates a full copy of the disk, while the second one is differential - the differential one will fail or stay in pending as it cannot query the source disk while the instance is down to check what has been changed . You can check this for further info.
Dettaching the disk from the instance and then creating a snapshot works, so that could be a workaround for you.
Hope this helps

Related

Google compute engine, instance dead? How to reach?

I have a small instance running in GCE, had some troubles with the MongoDb so after some tries decided to reset the instance. But... it didn't seem to come back online. So i stopped the instance and restarted it.
It is an Bitnami MEAN stack which starts apache and stuff at startup.
But... i can't reach the instance! No SCP, no SSH, no webservice running. When i try to connect via SSH (in GCE) it times out, cant make connection on port 22. In the information it says 'The instance is booting up and sshd is not running yet', which is possible of course.... But i cant reach the instance in no possible manner not even after an hour wait :) Not sure what's happening if i cant connect to it somehow :(
There is some activity in the console... some CPU usage, mostly 0%, some incomming traffic but no outgoing...
I hope someone can give me a hint here!
Update 1
After the helpfull tip form Serhii... if found this in the logs...
Booting from Hard Disk 0...
[ 0.872447] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
fsck exited with status code 4
The root filesystem on /dev/sda1 requires a manual fsck
Update 2...
So, i need to fsck the drive...
Created a snapshot, made a new disk from that snapshot, added the new disk as an extra disk to another instance. Now that instance wont boot with the same problem... removing the extra disk fixed it again. So adding the disk makes it crash even though it isn't the boot-disk?
First, have a look at the Compute Engine -> VM instances -> NAME_OF_YOUR_VM -> Logs -> Serial port 1 (console) and try to find errors and warnings that could be connected to lack of free space or SSH. It'll be helpful if you updated your post by providing this information. In case if your instance run out of free space follow this instructions.
You can try to connect to your VM via Serial console by following this guide, but keep in mind that:
The interactive serial console does not support IP-based access
restrictions such as IP whitelists. If you enable the interactive
serial console on an instance, clients can attempt to connect to that
instance from any IP address.
more details you can find in the documentation.
Have a look at the Troubleshooting SSH guide and Known issues for SSH in browser. In addition, Google provides a troubleshooting script for Compute Engine to identify issues with SSH login/accessibility of your Linux based instance.
If you still have a problem try to use your disk on a new instance.
EDIT It looks like your test VM is trying to boot from the disk that you created from the snapshot. Try to follow this guide.
If you still have a problem, you can try to recreate the boot disk from a snapshot to resize it.

AWS MySQL RDS instance becomes unresponsive and getting restarted automatically

We have a AWS MySQL RDS instance which is about 1.7T in size. Sometimes it becomes unresponsive and no operations can be performed.
CPU utilization, Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drops to zero.
Number of connections get piled up.
"Show engine innodb status" hangs
Lot of queries (around 25 for each) by rdsadmin which are in hang state.
SELECT count(*) from mysql.rds_replication_status WHERE action = 'reset slave' and master_host is NULL and master_port is NULL GROUP BY action_timestamp,called_by_user,action,mysql_version,master_host,master_port ORDER BY action_timestamp LIMIT 1;
SELECT NAME, VALUE FROM mysql.rds_configuration;
After sometime, instance gets rebooted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.
What can be the issue? This happens quite often. Sometimes, for our surprise, this happens in off-peak times too.
I faced the same issue and raised an issue with AWS Support. Got the following explanation:
The RDS Monitoring service discovered issue regarding backing up Binary Logs of your databases which is critical for Point in Time Restore (PITR) feature. To mitigate this issue and in order to avoid data corruption, RDS monitoring restarted the RDS instance and hence a restart was automatically triggered. In order to make sure that there is no data loss it took a snapshot of DB instance.
Although the RDS instance was multi-AZ it didn't fail over because of following reason:
Multi AZ has 2 criteria:
1- Single Box Experience, which means that Customer always finds his data even after failover.
2- Higher Availability than Single AZ.
So both criteria have to be present when AWS monitoring service takes the Decision to failover to the standby instance, but in your case AWS monitoring service noticed some risk that can cause data loss after the failover and that is why it took the decision to reboot instead of failing over.
Hope this helps. This has happened to me 3 times in last one week though.
check your db maintenance window timing i mean when your schedule maintenance is happening , and note at what time this issue happening is it happening in regular interval or randomly .
check both mysql error logs and slow query logs.
if possible paste the suspected issue here
We were able to resolve this issue by upgrading the instances to 5.6.34.

AWS MySQL RDS becomes unresponsive time to time

We have two MySQL RDS instances (Master and read replica). As usual we write to the master and read from the slave.
Master server works fine, but we observed that slave server becomes unresponsive time to time.
Observations:
Monitoring Graphs
CPU utilization drops down to 0
Increase in number of connections
Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drop to 0.
This can be resolved with a restart, but we are interested in finding the root cause. Basically when this happens, we can still log in to mysql prompt, but we can't execute any queries. AWS console shows instance as healthy, no errors are shown.
According to the graphs, there is no any abnormal activity or increase in resource utilization just before this happens. Everything looks normal.
(Small climbs in the attached graphs are normal. Those are in line with the business pattern. Historically instance survived against larger mountains)
Please let me know if you happen to come across such a situation.
Thanks.
Note:
Instance Information
db.m4.xlarge
IOPS 2000
Size 50G
Basically, instance is being under utilized when the issue happens
Note:
If we wait without restarting the instance, it gets restarted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.

Can't delete google cloud sql replication master instance

I decided to play around with Google Could SQL and I setup a test sql instance, loaded it with some data and then setup replication on it in the google dev console. I did my testing and found out it all works great, the master/slave setup works as it should and my little POC was a success. So now I want to delete the POC sql instances but that's not going so well.
I deleted the replica instance fine (aka the 'slave') but for some reason the master instance still thinks there is a slave and therefore will not let me delete it. For example I run the following command in the gclound shell:
gcloud sql instances delete MY-INSTANCE-NAME
I get the following message:
ERROR: (gcloud.sql.instances.delete) The requested operation is not valid for a replication master instance.
This screenshot also shows that in the google dev console it clearly thinks there are no replicas attached to this instance (because I deleted them) but when I run:
gcloud sql instances describe MY-INSTANCE-NAME
It shows that there is a replica name still attached to the instance.
Any ideas on how to delete this for good? Kinda lame to keep on paying for this when it was just a POC that I want to delete (glad I didn't pick a high memory machine!)
Issue was on Google's side and they fixed it. Here were the sequence of events that led to the issue happening:
1) Change master's tier
2) Promote replica to master while the master tier change is in progress
Just had the same problem using GCloud. Deleting the failover replica first and then the master instance worked for me.

Snapshot of EBS volume used for replication

I setup an EC2 instance with MySQL on EBS volume and setup another instance which acts as Slave for Replication. The replication set up was fine. My question is about taking snapshots of these volumes. I noticed that the tables need to be locked for the snapshot process which may cause inconvenience for the users. So, my idea is to leave the Master instance alone and take a snapshot of instance acting as slave. Is this a good idea? Is there anyone out with a similar setup and could guide me in a right way?
Also, taking snapshot of slave instance would require locking of tables. Would that mean replication will break?
Thanks in advance.
Though it's a good idea to lock the database and freeze the file system when you initiate the snapshot, the actual API call to initiate the snapshot takes a fraction of a second, so your database and file system aren't locked/frozen for long.
That said, there are a couple other considerations you did not mention:
When you attempt to create the lock on the database, it might need to wait for other statements to finish before the lock is granted. During this time, your pending lock might further statements to wait until you get and release the lock. This can cause interruptions in the flow of statements on your production database.
After you initiate the creation of the snapshot, your application/database is free to use the file system on the volume, but if you have a lot of writes, you could experience high iowait, sometimes enough to create a noticeable slowdown of your application. The reason for this is that the background snapshot process needs to copy a block to S3 before it will allow a write to that block on the active volume.
I solve the first issue by requesting a lock and timing out if it is not granted quickly. I then wait a bit and keep retrying until I get the lock. Appropriate timeouts and retry delay may vary for different database loads.
I solve the second problem by performing the frequent, consistent snapshots on the slave instead of the master, just as you proposed. I still recommend performing occasional snapshots against the master simply to improve its intrinsic durability (a deep EBS property) but those snapshots do not need to be performed with locking or freezing as you aren't going to use them for backups.
I also recommend the use of a file system that supports flushing and freezing (XFS). Otherwise, you are snapshotting locked tables in MySQL that might not yet even have all their blocks on the EBS volume yet or other parts of the file system might be modified and inconsistent in the snapshot.
If you're interested, I've published open source software that performs the best practices I've collected related to creating consistent EBS snapshots with MySQL and XFS (both optional).
http://alestic.com/2009/09/ec2-consistent-snapshot
To answer your last question, locking tables in the master will not break replication. In my snapshot software I also flush the tables with read lock to make sure that everything is on the disk being snapshotted and I add the keyword "LOCAL" so that the flush is not replicated to any potential slaves.
You can definitely take a snapshot of the slave.
From your description, it does not seem like the slave is being used operationally.
If this is the case, then the safest method of obtaining a reliable volume snapshot would be to:
Stop mysql server on the slave
start the snapshot (either through the AWS Console, or by command line)
When the snapshot is complete, restart mysqld on the slave server