GCE Windows Server gets auto shut down - google-compute-engine

My Windows Server instance on GCE is shut down from time to time. Based on the GCP logging, we can tell that fail to pass the lateBootReportEvent check only triggers a reboot by a certain chance. I am wondering why?
logs screenshot
I am aware that auto-shutdown is caused by integrity monitoring (settings shown below). And I understand that my boot integrity might fail here. I am just trying to understand why there is a "probability" here
Shielded-VM settings

The integrity monitor and shielded VMs don't have any relation with a VM restart or shutdown.
Integrity monitoring only compares the most recent boot measurements to the integrity policy baseline and returns a pair of pass/fail results depending on whether they match or not, one for the early boot sequence and one for the late boot sequence.
Early boot is the boot sequence from the start of the UEFI firmware until it passes control to the bootloader. Late boot is the boot sequence from the bootloader until it passes control to the operating system kernel. If either part of the most recent boot sequence doesn't match the baseline, you get an integrity validation failure.
If the failure is expected, for example if you applied some system update on that VM instance, you should update the integrity policy baseline. If it is not expected, you should stop that VM instance and investigate the reason for the failure, but the VM never be shutdown by integrity monitor .
In order to determine what actually caused the VM to restart you will need to look at the internal Windows event manager logs, and review the event viewer logs for the instance at time to shutdown, then reference the shutdown reason against Microsoft's reason codes to determine what caused the VM stop.
It is possible that the instance restarted to complete installation of updates, or encountered an internal error. However only the event viewer logs will determine the true cause.
If you found a useful internal logs please share on this post to check.

Related

How does Google Compute Engine decide what instances to shut down when autoscaling?

I'm creating a managed instance group with autoscaling in GCE. When a lot of work is queued up new instances will be created which start doing work.
Let's say each chunk of work takes 10 minutes, could it happen that GCE decides to shut down an instance that still has work in progress?
Autoscaler will immediately terminate instance if the health check condition meets.
However, you can use a shutdown script to control the termination. A shutdown script will run, on a best-effort basis, in the brief period between when the termination request is made and when the instance is actually terminated. During this period, Compute Engine will attempt to run your shutdown script to perform any tasks you provide in the script. You can read more about the autoscaler decision in this document. You can read about using shutdown script and its limitation at this link.
Also if these instances are offering backend service then it is good to enable connection draining. You can enable connection draining on backend services to ensure minimal interruption to your users when an instance is deleted automatically by an autoscaler or manually removed from an instance group. You can find more at this link about enabling connection draining.

Why is spark filling the tmp (spark.local.dir) in the machine that submits jobs?

I have a spark 1.2.1 cluster set up in standalone mode with a master and a few slaves. I then let my data scientists enjoy the cluster's power.
All is working fine. However, the dedicated server that my data scientists used to submit spark jobs have its spark.local.dir filled up gradually.
Given that this machine is sitting outside of the cluster, not a master, nor a worker/slave, I wouldn't think that the local spark.local.dir is used in any way by spark. (And why would it? It only shows the logs.)
I could not find a good doc detailing this part of information. Does anybody have an idea?
Not enough information about your setup to be sure, but I am guessing that the jobs are launched in client mode where the driver would be on your client node.
From the spark docs:
In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
I am guessing that in client mode the driver (on your client machine) of the application needs plenty of scratch space to manage the other workers in that case.

Google Compute Engine - Where is the STOPPED instance status?

Yesterday I tried to delete an Instance by invoking the "halt" command through SSH. Unlike AWS, GCE does not allow us to choose the behavior of the VM shutdown and stop the instance by default (the instance status is TERMINATED).
Today I was browsing the Google Compute Engine REST API documentation and I found the following description :
status : [Output Only] The status of the instance. One of the following values: PROVISIONING, STAGING, RUNNING, STOPPING, STOPPED, TERMINATED.
What is this "STOPPPED" status ? Both the instances stopped through the Web console or the "halt" command have the "TERMINATED" status.
Any ideas ?
This STOPPED state is a new feature added a few weeks ago which you can reach via the compute engine API.
This method stops a running instance, shutting it down cleanly, and allows you to restart the instance at a later time. Stopped instances do not incur per-minute, virtual machine usage charges while they are stopped, but any resources that the virtual machine is using, such as persistent disks and static IP addresses,will continue to be charged until they are deleted. For more information, see Stopping an instance.
I think this is similar to the AWS option you mention.
For anyone stumbling on this question years later, a detailed lifecycle diagram of instances can be found here
There is no STOPPED status anymore, instances are going from STOPPING to TERMINATED, whatever the stopping method is.
However a new state, that may be closer to what halt does, has been introduced since: SUSPENDED. It's still in beta though, and not sure that invoking halt would induce this state or simply terminates the instance.
See here for more details

Instance failing to boot from disk on Google Compute Engine

I'm trying to boot an instance on GCE through libcloud.
When I boot through the libcloud function, ex_create_multiple_nodes (with 1 machine specified), the instance and the disk are created successfully, and the disk is attached. I verify this through the developer console. No exceptions are thrown by the function call.
Unfortunately, the instance never boots successfully:
...
Booting from Hard Disk...
Boot failed: not a bootable disk
...
Full log: https://gist.github.com/danwinkler/dcf1351675eb8c744220
(This repeats again and again)
I've tested booting with the same parameters (snapshot, zone, size, etc.) through the developers console and it works fine.
A colleague pointed out that the error looks similar to those caused by virt-manager, but I don't see anything related to that in the docs or the console Link.
Thanks!
This error normally happens when you're trying to boot from an empty disk. You can attach the disk to another VM instance and check the disk contents to ensure that is has a valid and bootable partition.

Is it possible to patch clustered SQL Server without a BizTalk outage?

We have a BizTalk Server install backed by a clustered SQL environment for high availability.
However, whenever the SQL environment is patched there is a momentary outage as part of node failover. Consequently the host instances stop and BizTalk shuts down (if we move to CU2 the host instances will automatically restart, but this is a separate issue).
This is undesirable, as it prevents incoming web requests and breaks open web service clients. As such, is there a strategy for gracefully patching SQL Server without a BizTalk outage?
It seems this is impossible. Marking this as the accepted answer until someone can pleasantly surprise me otherwise.