GCE VM shuts down after having Shield VM enabled and will not boot - google-compute-engine

We have recently enabled Shield VM on a node. The node restarted fine afterwards, however has since shut down and won't boot. We see any boot sequence in the logs as shutting down due to shield VM:
{
insertId: "3"
jsonPayload: {
#type: "type.googleapis.com/cloud_integrity.IntegrityEvent"
bootCounter: "4"
shutdownEvent: {
}
}
logName: "projects/project_name/logs/compute.googleapis.com%2Fshielded_vm_integrity"
receiveTimestamp: "2021-03-26T09:30:41.307564027Z"
resource: {
labels: {…}
type: "gce_instance"
}
severity: "NOTICE"
timestamp: "2021-03-26T09:30:39.300465584Z"
}
However there appears no integrity violation. We have tried disabling Shield VM but still encounter the error. The event prior to shutdown is a late boot event:
insertId: "2"
jsonPayload: {
lateBootReportEvent: {
actualMeasurements: [4]
policyEvaluationPassed: true
Is there any way to bypass the Shield VM checks embedded in the boot sequence to get the node up?

I would suggest to disable all the shielded-vm tags applied to the node, including the integrity-monitoring, through the Cloud Shell or the Shielded VM tab of the VM, then try to boot the node again. You can make use of the command:
gcloud beta compute instances update YOUR_NODE_NAME --no-shielded-integrity-monitoring --no-shielded-secure-boot] --no-shielded-vtpm
You can find more info about the command and the tags here.

Related

Google Cloud Build windows builder error "Failed to get external IP address: Could not get external NAT IP from list"

I am trying to implement automatic deployments for my Windows Kubernetes container app. I'm following instructions from the Google's windows-builder, but the trigger quickly fails with this error at about 1.5 minutes in:
2021/12/16 19:30:06 Set ingress firewall rule successfully
2021/12/16 19:30:06 Failed to get external IP address: Could not get external NAT IP from list
ERROR
ERROR: build step 0 "gcr.io/[my-project-id]/windows-builder" failed: step exited with non-zero status: 1
The container, gcr.io/[my-project-id]/windows-builder, definitely exists and it's located in the same GCP project as the Cloud Build trigger just as the windows-builder documentation commanded.
I structured my code based off of Google's docker-windows example. Here is my repository file structure:
repository
cloudbuild.yaml
builder.ps1
worker
Dockerfile
Here is my cloudbuild.yaml:
steps:
# WORKER
- name: 'gcr.io/[my-project-id]/windows-builder'
args: [ '--command', 'powershell.exe -file build.ps1' ]
# OPTIONS
options:
logging: CLOUD_LOGGING_ONLY
Here is my builder.ps1:
docker build -t gcr.io/[my-project-id]/test-worker ./worker;
if ($?) {
docker push gcr.io/[my-project-id]/test-worker;
}
Here is my Dockerfile:
FROM gcr.io/[my-project-id]/test-windows-node-base:onbuild
Does anybody know what I'm doing wrong here? Any help would be appreciated.
Replicated the steps from GitHub and got the same error. It is throwing Failed to get external IP address... error because the External IP address of the VM is disabled by default in the source code. I was able to build it successfully by adding '--create-external-ip', 'true' in cloudbuild.yaml.
Here is my cloudbuild.yaml:
steps:
- name: 'gcr.io/$PROJECT_ID/windows-builder'
args: [ '--create-external-ip', 'true',
'--command', 'powershell.exe -file build.ps1' ]

How to call a Google Cloud Function from Google Cloud Scheduler when ingressSettings = ALLOW_INTERNAL_ONLY?

I have, in the same project, one HTTP Cloud Function and a Cloud Scheduler, that sends a POST request to this function.
I want to allow only requests from within the project to call the Function. However, when I set Ingress Settings to "Allow internal traffic only", the Cloud Scheduler gets "PERMISSION_DENIED"
Here is the error log (edited)
httpRequest: {
status: 403
}
insertId: "insert_id"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished"
jobName: "projects/project_name/locations/location/jobs/cloud_scheduler_job"
status: "PERMISSION_DENIED"
targetType: "HTTP"
url: "https://location-project_name.cloudfunctions.net/cloud_function_name"
}
logName: "projects/project_name/logs/cloudscheduler.googleapis.com%2Fexecutions"
receiveTimestamp: "2020-02-20T13:15:43.134508712Z"
resource: {
labels: {
job_id: "cloud_scheduler_name"
location: "location"
project_id: "project_id"
}
type: "cloud_scheduler_job"
}
severity: "ERROR"
timestamp: "2020-02-20T13:15:43.134508712Z"
}
Link to UI options for ingressSettings
According to the official documentation:
To use Cloud Scheduler your Cloud project must contain an App Engine
app that is located in one of the supported regions. If your project
does not have an App Engine app, you must create one.
Cloud Scheduler overview
Therefore find the location of your app engine application by running:
gcloud app describe
#check for the locationId: europe-west2
Then make sure that you deploy your cloud function with Ingress Settings to "Allow internal traffic only" to the same location as your app engine application.
I deployed a cloud function in the same region as my app engine application and everything worked as expected.
when you use the option "Allow internal traffic only" you need to use some kind of authentication within your Cloud Functions(to avoid this you can use the option "Allow all traffic").
please check the third comment provided in the link: https://serverfault.com/questions/1000987/trigger-google-cloud-functions-from-google-cloud-scheduler-with-private-network

Instance Doesnt boot correctly, hangs on - "a start job is running for LSB: Raise network Interface.."

My VM was shutdown due to end of Trial. However i have since made payment and started other instances.
GCE UI shows this system as successfully booted, however looking at the serial port it shows the following (see image)
Any ideas how to fix this ?
Screenshot of Boot Error:
[ 6.895575] ppdev: user-space parallel port driver
[ 6.951588] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 6.993046] AVX version of gcm_enc/dec engaged.
[ 6.996351] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
[ 7.001659] alg: No test for crc32 (crc32-pclmul)
[ OK ] Started LSB: start firewall.
[***] A start job is running for LSB: Raise network interf...17s / no limit)

GCE - No stackdriver memory metrics for nodes

I have set up my Kubernetes 1.3.4 cluster on GCE with
export KUBE_ENABLE_CLUSTER_MONITORING=google
This works quite nicely, I get application logs (for some reason in the Container Engine section, but well) and also pod and node metrics.
The only thing that is missing are the node memory metrics, only CPU is shown (see screenshot)
No memory metrics
In the heapster logs I see tons of lines like this
{
metadata: {
severity: "ERROR"
projectId: "<project-id>"
serviceName: "container.googleapis.com"
zone: "europe-west1-d"
labels: {
container.googleapis.com/cluster_name: "production"
compute.googleapis.com/resource_type: "instance"
compute.googleapis.com/resource_name: "fluentd-cloud-logging-production-minion-group-p0w8"
container.googleapis.com/instance_id: "6772154497331326454"
container.googleapis.com/pod_name: "heapster-v1.1.0-2102007506-23b3e"
compute.googleapis.com/resource_id: "6772154497331326454"
container.googleapis.com/stream: "stderr"
container.googleapis.com/namespace_name: "kube-system"
container.googleapis.com/container_name: "heapster"
}
timestamp: "2016-09-13T14:40:08.000Z"
projectNumber: "930564692351"
}
textPayload: "E0913 14:40:08.665035 1 gcm.go:179] Error while sending request to GCM googleapi: Error 400: Timeseries 76, point: start is not older than end, for a cumulative metric, invalidParameter
"
insertId: "pt5bo7g132r266"
log: "heapster"
}
Not sure if this is related.
Any ideas?
If you are running your cluster using GCE instead of GKE
You should install the stackdriver agent and verify the credentials that agent is using to communicate with stackdriver link
If you are using linux you can install the agent by executing:
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
sudo bash install-monitoring-agent.sh
and you can check your credentials running the following command:
sudo cat $GOOGLE_APPLICATION_CREDENTIALS
sudo cat /etc/google/auth/application_default_credentials.json

Startup script from Bitbucket (https) fail to download, but works if instance is reset

I am programatically launching a new instance using the Compute Engine API for Go [1], and a tool I made called vmproxy [2].
The problem I have is that if I launch a preemptible VM using a startup-script-url pointing to https://bitbucket.org/ronoaldo/debian-custom/raw/tip/tools/autobuild, the build script fails to download. I can see in the serial console output that the the startup script metadata is there, and that it attempts to be downloaded with curl, but that part fails.
However, if I reset the instance via the developers console, the script is properly downloaded and runs nicelly.
The code I am using to setup the instance is:
// Ronolinux is a VM Proxy that runs an live systems build on Compute Engine
var (
Ronolinux = &vmproxy.VM{
Path: "/",
Instance: vmproxy.Instance{
Name: "ronolinux-buildd",
Zone: "us-central1-f",
Image: vmproxy.ResourcePrefix + "/debian-cloud/global/images/debian-8-jessie-v20150915",
MachineType: "n1-standard-1",
Metadata: map[string]string{
"startup-script-url": "https://bitbucket.org/ronoaldo/debian-custom/raw/tip/tools/autobuild",
"shutdown-script": `!#/bin/bash
gsutil cp /var/log/startupscript.log gs://ronoaldo/ronolinux/build-$(date +%Y%m%d%H%M%S).log
`,
},
Scopes: []string{ storageReadWrite },
},
}
)
[1] https://godoc.org/google.golang.org/api/compute/v1
[2] https://godoc.org/ronoaldo.gopkg.net/aetools/vmproxy
If your startup script is not hosted on Cloud Storage, there is a random chance the download will fail. If you look at the serial console output, make sure to scroll horizontally, as it will not wrap long lines. In my case, the error line was very long, and this hidded the real end of the message:
(... long curl on-line progress output )
curl: (7) Failed to connect to bitbucket.org port 443: Connection timed out
(...)
Your host must respond within a 10s timeout. In my case, the first boot usually failed to contact Bitbucket, hence failing to download the script; a VM reset also made things work, as the network latency outside Google Cloud were probably better.
I ended up moving to host the script on cloud storage to avoid these issues.