How does Google Compute Engine decide what instances to shut down when autoscaling? - google-compute-engine

I'm creating a managed instance group with autoscaling in GCE. When a lot of work is queued up new instances will be created which start doing work.
Let's say each chunk of work takes 10 minutes, could it happen that GCE decides to shut down an instance that still has work in progress?

Autoscaler will immediately terminate instance if the health check condition meets.
However, you can use a shutdown script to control the termination. A shutdown script will run, on a best-effort basis, in the brief period between when the termination request is made and when the instance is actually terminated. During this period, Compute Engine will attempt to run your shutdown script to perform any tasks you provide in the script. You can read more about the autoscaler decision in this document. You can read about using shutdown script and its limitation at this link.
Also if these instances are offering backend service then it is good to enable connection draining. You can enable connection draining on backend services to ensure minimal interruption to your users when an instance is deleted automatically by an autoscaler or manually removed from an instance group. You can find more at this link about enabling connection draining.

Related

GCE Windows Server gets auto shut down

My Windows Server instance on GCE is shut down from time to time. Based on the GCP logging, we can tell that fail to pass the lateBootReportEvent check only triggers a reboot by a certain chance. I am wondering why?
logs screenshot
I am aware that auto-shutdown is caused by integrity monitoring (settings shown below). And I understand that my boot integrity might fail here. I am just trying to understand why there is a "probability" here
Shielded-VM settings
The integrity monitor and shielded VMs don't have any relation with a VM restart or shutdown.
Integrity monitoring only compares the most recent boot measurements to the integrity policy baseline and returns a pair of pass/fail results depending on whether they match or not, one for the early boot sequence and one for the late boot sequence.
Early boot is the boot sequence from the start of the UEFI firmware until it passes control to the bootloader. Late boot is the boot sequence from the bootloader until it passes control to the operating system kernel. If either part of the most recent boot sequence doesn't match the baseline, you get an integrity validation failure.
If the failure is expected, for example if you applied some system update on that VM instance, you should update the integrity policy baseline. If it is not expected, you should stop that VM instance and investigate the reason for the failure, but the VM never be shutdown by integrity monitor .
In order to determine what actually caused the VM to restart you will need to look at the internal Windows event manager logs, and review the event viewer logs for the instance at time to shutdown, then reference the shutdown reason against Microsoft's reason codes to determine what caused the VM stop.
It is possible that the instance restarted to complete installation of updates, or encountered an internal error. However only the event viewer logs will determine the true cause.
If you found a useful internal logs please share on this post to check.

Cloud SQL instance 2nd Generation ALTERNATIVE activation policy "ON DEMAND"

I have problem with Cloud SQL billing.
My Cloud SQL has used all 720 Hours running machine (db-g1-small : changed from db-n1-standard-1 recently)
I've found accordding to Cloud SQL Documentation that
For Second Generation instances, the activation policy is used only to start or stop the instance.
So without ON_DEMAND policy of the First Generation, how can I reduce these costs on my Cloud SQL instance?
PS. Look like my cloud server not automatically down because it stay 4 sleep connections
Indeed for second generation instances of Cloud SQL, the only activation policies available are ALWAYS and NEVER, so it's not possible anymore to leave that kind of instance handling entirely on Cloud SQL's hands.
However you can create a workaround for this by executing a cron job that turns the instances on/off on a fixed schedule. Eg: you can run a cron job that runs on friday night to shutdown the instance and on monday morning to shut it back on.
You can use the following command to do so:
gcloud sql instances patch [INSTANCE_NAME] --activation-policy [ACTIVATION_POLICY_VALUE]
Moreover, you can create a feature request on Google Cloud's Public Issue Tracker System to re-include that functionality on Cloud SQL in the future, but there are no guaratees that this will happen.

Why do my google cloud compute instances always unexpectedly restart?

Help! Help! Help!
It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!
The issue is right here:
all GPUs are in use
If you check the official documentation about GPU:
GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.
This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.
This sounds like Preemptible VM instance.
Preemptible instances function like normal instances, but have the following limitations:
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
Compute Engine always terminates preemptible instances after they run for 24 hours.
To check if your instance is preemptible using gcloud cli, just run
gcloud compute instances describe instance-name --format="(scheduling.preemptible)"
Result
scheduling:
preemptible: false
change "instance-name" to real name.
Or simply via UI, click on compute instance and scroll down:
To check for system operations performed on your instance, you can review it using following command:
gcloud compute operations list

GCP HTTPS load balancing: when is it safe to delete instance?

TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer

Google Compute Engine - Where is the STOPPED instance status?

Yesterday I tried to delete an Instance by invoking the "halt" command through SSH. Unlike AWS, GCE does not allow us to choose the behavior of the VM shutdown and stop the instance by default (the instance status is TERMINATED).
Today I was browsing the Google Compute Engine REST API documentation and I found the following description :
status : [Output Only] The status of the instance. One of the following values: PROVISIONING, STAGING, RUNNING, STOPPING, STOPPED, TERMINATED.
What is this "STOPPPED" status ? Both the instances stopped through the Web console or the "halt" command have the "TERMINATED" status.
Any ideas ?
This STOPPED state is a new feature added a few weeks ago which you can reach via the compute engine API.
This method stops a running instance, shutting it down cleanly, and allows you to restart the instance at a later time. Stopped instances do not incur per-minute, virtual machine usage charges while they are stopped, but any resources that the virtual machine is using, such as persistent disks and static IP addresses,will continue to be charged until they are deleted. For more information, see Stopping an instance.
I think this is similar to the AWS option you mention.
For anyone stumbling on this question years later, a detailed lifecycle diagram of instances can be found here
There is no STOPPED status anymore, instances are going from STOPPING to TERMINATED, whatever the stopping method is.
However a new state, that may be closer to what halt does, has been introduced since: SUSPENDED. It's still in beta though, and not sure that invoking halt would induce this state or simply terminates the instance.
See here for more details