Google Compute Engine - Where is the STOPPED instance status? - google-compute-engine

Yesterday I tried to delete an Instance by invoking the "halt" command through SSH. Unlike AWS, GCE does not allow us to choose the behavior of the VM shutdown and stop the instance by default (the instance status is TERMINATED).
Today I was browsing the Google Compute Engine REST API documentation and I found the following description :
status : [Output Only] The status of the instance. One of the following values: PROVISIONING, STAGING, RUNNING, STOPPING, STOPPED, TERMINATED.
What is this "STOPPPED" status ? Both the instances stopped through the Web console or the "halt" command have the "TERMINATED" status.
Any ideas ?

This STOPPED state is a new feature added a few weeks ago which you can reach via the compute engine API.
This method stops a running instance, shutting it down cleanly, and allows you to restart the instance at a later time. Stopped instances do not incur per-minute, virtual machine usage charges while they are stopped, but any resources that the virtual machine is using, such as persistent disks and static IP addresses,will continue to be charged until they are deleted. For more information, see Stopping an instance.
I think this is similar to the AWS option you mention.

For anyone stumbling on this question years later, a detailed lifecycle diagram of instances can be found here
There is no STOPPED status anymore, instances are going from STOPPING to TERMINATED, whatever the stopping method is.
However a new state, that may be closer to what halt does, has been introduced since: SUSPENDED. It's still in beta though, and not sure that invoking halt would induce this state or simply terminates the instance.
See here for more details

Related

GCE Windows Server gets auto shut down

My Windows Server instance on GCE is shut down from time to time. Based on the GCP logging, we can tell that fail to pass the lateBootReportEvent check only triggers a reboot by a certain chance. I am wondering why?
logs screenshot
I am aware that auto-shutdown is caused by integrity monitoring (settings shown below). And I understand that my boot integrity might fail here. I am just trying to understand why there is a "probability" here
Shielded-VM settings
The integrity monitor and shielded VMs don't have any relation with a VM restart or shutdown.
Integrity monitoring only compares the most recent boot measurements to the integrity policy baseline and returns a pair of pass/fail results depending on whether they match or not, one for the early boot sequence and one for the late boot sequence.
Early boot is the boot sequence from the start of the UEFI firmware until it passes control to the bootloader. Late boot is the boot sequence from the bootloader until it passes control to the operating system kernel. If either part of the most recent boot sequence doesn't match the baseline, you get an integrity validation failure.
If the failure is expected, for example if you applied some system update on that VM instance, you should update the integrity policy baseline. If it is not expected, you should stop that VM instance and investigate the reason for the failure, but the VM never be shutdown by integrity monitor .
In order to determine what actually caused the VM to restart you will need to look at the internal Windows event manager logs, and review the event viewer logs for the instance at time to shutdown, then reference the shutdown reason against Microsoft's reason codes to determine what caused the VM stop.
It is possible that the instance restarted to complete installation of updates, or encountered an internal error. However only the event viewer logs will determine the true cause.
If you found a useful internal logs please share on this post to check.

Unable to connect with Google Compute Engine

My GCP Compute engine is down. In GCP console it is up and running. It is an Ubuntu 18.04 server with 0.6GB memory(an always free tier compute engine). It was not restarted for more than couple of months. The system usage was around 60% last checked. I have already checked this answer. And none of them seems valid for me(seems so).
Free disk stands at 54%.
SSH perfectly configured.
No firewall issue.
It just seemed the VM stopped responding putting the hosted url down. When I checked Compute engine monitoring tabs, all the graphs were normal without any visible changes. I even checked the logging but no system crash kind of logs were present. I stopped and restarted the compute engine, and it started working perfectly, as nothing happened. In AWS, the VM instance failed System Reachability tests in such scenarios.
Does GCP has something similar like AWS system reachability test?
Any possible logs or something, by which I can understand the reason why the Compute Engine stopped responding?
There are different kinds of test with custom scenarios in your project , please find it on reference link [1]
[1] https://cloud.google.com/network-intelligence-center/docs/connectivity-tests/how-to/running-connectivity-tests

How does Google Compute Engine decide what instances to shut down when autoscaling?

I'm creating a managed instance group with autoscaling in GCE. When a lot of work is queued up new instances will be created which start doing work.
Let's say each chunk of work takes 10 minutes, could it happen that GCE decides to shut down an instance that still has work in progress?
Autoscaler will immediately terminate instance if the health check condition meets.
However, you can use a shutdown script to control the termination. A shutdown script will run, on a best-effort basis, in the brief period between when the termination request is made and when the instance is actually terminated. During this period, Compute Engine will attempt to run your shutdown script to perform any tasks you provide in the script. You can read more about the autoscaler decision in this document. You can read about using shutdown script and its limitation at this link.
Also if these instances are offering backend service then it is good to enable connection draining. You can enable connection draining on backend services to ensure minimal interruption to your users when an instance is deleted automatically by an autoscaler or manually removed from an instance group. You can find more at this link about enabling connection draining.

GCP HTTPS load balancing: when is it safe to delete instance?

TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer

Instance failing to boot from disk on Google Compute Engine

I'm trying to boot an instance on GCE through libcloud.
When I boot through the libcloud function, ex_create_multiple_nodes (with 1 machine specified), the instance and the disk are created successfully, and the disk is attached. I verify this through the developer console. No exceptions are thrown by the function call.
Unfortunately, the instance never boots successfully:
...
Booting from Hard Disk...
Boot failed: not a bootable disk
...
Full log: https://gist.github.com/danwinkler/dcf1351675eb8c744220
(This repeats again and again)
I've tested booting with the same parameters (snapshot, zone, size, etc.) through the developers console and it works fine.
A colleague pointed out that the error looks similar to those caused by virt-manager, but I don't see anything related to that in the docs or the console Link.
Thanks!
This error normally happens when you're trying to boot from an empty disk. You can attach the disk to another VM instance and check the disk contents to ensure that is has a valid and bootable partition.