GCE - No stackdriver memory metrics for nodes - google-compute-engine

I have set up my Kubernetes 1.3.4 cluster on GCE with
export KUBE_ENABLE_CLUSTER_MONITORING=google
This works quite nicely, I get application logs (for some reason in the Container Engine section, but well) and also pod and node metrics.
The only thing that is missing are the node memory metrics, only CPU is shown (see screenshot)
No memory metrics
In the heapster logs I see tons of lines like this
{
metadata: {
severity: "ERROR"
projectId: "<project-id>"
serviceName: "container.googleapis.com"
zone: "europe-west1-d"
labels: {
container.googleapis.com/cluster_name: "production"
compute.googleapis.com/resource_type: "instance"
compute.googleapis.com/resource_name: "fluentd-cloud-logging-production-minion-group-p0w8"
container.googleapis.com/instance_id: "6772154497331326454"
container.googleapis.com/pod_name: "heapster-v1.1.0-2102007506-23b3e"
compute.googleapis.com/resource_id: "6772154497331326454"
container.googleapis.com/stream: "stderr"
container.googleapis.com/namespace_name: "kube-system"
container.googleapis.com/container_name: "heapster"
}
timestamp: "2016-09-13T14:40:08.000Z"
projectNumber: "930564692351"
}
textPayload: "E0913 14:40:08.665035 1 gcm.go:179] Error while sending request to GCM googleapi: Error 400: Timeseries 76, point: start is not older than end, for a cumulative metric, invalidParameter
"
insertId: "pt5bo7g132r266"
log: "heapster"
}
Not sure if this is related.
Any ideas?

If you are running your cluster using GCE instead of GKE
You should install the stackdriver agent and verify the credentials that agent is using to communicate with stackdriver link
If you are using linux you can install the agent by executing:
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
sudo bash install-monitoring-agent.sh
and you can check your credentials running the following command:
sudo cat $GOOGLE_APPLICATION_CREDENTIALS
sudo cat /etc/google/auth/application_default_credentials.json

Related

Google Cloud Build windows builder error "Failed to get external IP address: Could not get external NAT IP from list"

I am trying to implement automatic deployments for my Windows Kubernetes container app. I'm following instructions from the Google's windows-builder, but the trigger quickly fails with this error at about 1.5 minutes in:
2021/12/16 19:30:06 Set ingress firewall rule successfully
2021/12/16 19:30:06 Failed to get external IP address: Could not get external NAT IP from list
ERROR
ERROR: build step 0 "gcr.io/[my-project-id]/windows-builder" failed: step exited with non-zero status: 1
The container, gcr.io/[my-project-id]/windows-builder, definitely exists and it's located in the same GCP project as the Cloud Build trigger just as the windows-builder documentation commanded.
I structured my code based off of Google's docker-windows example. Here is my repository file structure:
repository
cloudbuild.yaml
builder.ps1
worker
Dockerfile
Here is my cloudbuild.yaml:
steps:
# WORKER
- name: 'gcr.io/[my-project-id]/windows-builder'
args: [ '--command', 'powershell.exe -file build.ps1' ]
# OPTIONS
options:
logging: CLOUD_LOGGING_ONLY
Here is my builder.ps1:
docker build -t gcr.io/[my-project-id]/test-worker ./worker;
if ($?) {
docker push gcr.io/[my-project-id]/test-worker;
}
Here is my Dockerfile:
FROM gcr.io/[my-project-id]/test-windows-node-base:onbuild
Does anybody know what I'm doing wrong here? Any help would be appreciated.
Replicated the steps from GitHub and got the same error. It is throwing Failed to get external IP address... error because the External IP address of the VM is disabled by default in the source code. I was able to build it successfully by adding '--create-external-ip', 'true' in cloudbuild.yaml.
Here is my cloudbuild.yaml:
steps:
- name: 'gcr.io/$PROJECT_ID/windows-builder'
args: [ '--create-external-ip', 'true',
'--command', 'powershell.exe -file build.ps1' ]

Cannot connect to TPU with ssh on GCP

I was following the tutorial on https://cloud.google.com/tpu/docs/how-to.
I created a TPU instance, and tried to connect to it with gcloud compute ssh line. Then, this error occurred.
AppData\Local\Google\Cloud SDK>gcloud compute ssh node-1 --zone=asia-east1-c
PythonERROR: (gcloud.compute.ssh) Could not fetch resource:
- The resource 'projects/project-masker/zones/asia-east1-c/instances/node-1' was not found
Trying to solve this error, I found out that the tpus were not included in the execution group.
AppData\Local\Google\Cloud SDK>gcloud compute tpus list
PythonNAME ZONE ACCELERATOR_TYPE NETWORK RANGE STATUS
node-2 asia-east1-c v2-8 default 10.75.202.248/29 READY
node-1 asia-east1-c v2-8 default 10.82.81.168/29 READY
AppData\Local\Google\Cloud SDK>gcloud compute tpus execution-groups list
PythonListed 0 items.
This is what I got when I tried to restart the tpu.
PythonRequest issued for: [node-1]
Waiting for operation [projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-
e14800b7-d997be6b] to complete...done.
done: true
metadata:
'#type': type.googleapis.com/google.cloud.common.OperationMetadata
apiVersion: v1
cancelRequested: false
createTime: '2021-07-03T08:00:49.884674545Z'
endTime: '2021-07-03T08:01:31.161199334Z'
target: projects/project-masker/locations/asia-east1-c/nodes/node-1
verb: update
name: projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-e14800b7-d997be6b
response:
'#type': type.googleapis.com/google.cloud.tpu.v1.Node
acceleratorType: v2-8
apiVersion: V1
cidrBlock: 10.82.81.168/29
createTime: '2021-07-03T07:27:41.148997156Z'
health: HEALTHY
ipAddress: 10.82.81.170
name: projects/project-masker/locations/asia-east1-c/nodes/node-1
network: global/networks/default
networkEndpoints:
- ipAddress: 10.82.81.170
port: 8470
port: '8470'
schedulingConfig: {}
serviceAccount: service-...#cloud-tpu.iam.gserviceaccount.com
state: READY
tensorflowVersion: pytorch-1.9
I tried to find some related articles on google, but I couldn't find any. How can I fix this?
You can't SSH to a TPU node directly, so gcloud compute ssh {tpu_name} isn't expected to work.
You can, however, SSH directly to a TPU VM, please see this link. If you are already using TPU VM, then your issue is that you're trying
gcloud compute ssh
rather than
gcloud alpha compute tpus tpu-vm ssh ...

GCE VM shuts down after having Shield VM enabled and will not boot

We have recently enabled Shield VM on a node. The node restarted fine afterwards, however has since shut down and won't boot. We see any boot sequence in the logs as shutting down due to shield VM:
{
insertId: "3"
jsonPayload: {
#type: "type.googleapis.com/cloud_integrity.IntegrityEvent"
bootCounter: "4"
shutdownEvent: {
}
}
logName: "projects/project_name/logs/compute.googleapis.com%2Fshielded_vm_integrity"
receiveTimestamp: "2021-03-26T09:30:41.307564027Z"
resource: {
labels: {…}
type: "gce_instance"
}
severity: "NOTICE"
timestamp: "2021-03-26T09:30:39.300465584Z"
}
However there appears no integrity violation. We have tried disabling Shield VM but still encounter the error. The event prior to shutdown is a late boot event:
insertId: "2"
jsonPayload: {
lateBootReportEvent: {
actualMeasurements: [4]
policyEvaluationPassed: true
Is there any way to bypass the Shield VM checks embedded in the boot sequence to get the node up?
I would suggest to disable all the shielded-vm tags applied to the node, including the integrity-monitoring, through the Cloud Shell or the Shielded VM tab of the VM, then try to boot the node again. You can make use of the command:
gcloud beta compute instances update YOUR_NODE_NAME --no-shielded-integrity-monitoring --no-shielded-secure-boot] --no-shielded-vtpm
You can find more info about the command and the tags here.

Scaling Up of GlusterFS-storage only add new peer without new bricks in Openshift

Observed behavior
I started with one node Openshift cluster and it successfully deployed master/node and gluster volume. Now I extend Openshift cluster and it was successfully.
but on extending glusterfs volume with below
[glusterfs]
10.1.1.1 glusterfs_devices='[ "/dev/vdb" ]'
10.1.1.2 glusterfs_devices='[ "/dev/vdb" ]' openshift_node_labels="type=upgrade"
ansible-playbook -i inventory2.ini /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -e openshift_upgrade_nodes_label="type=upgrade"
it only added 10.1.1.2 as peer but volume still has only one brick
Following customization done to start deploy gluster from 1 node {--durability none}
openshift-ansible/roles/openshift_storage_glusterfs/tasks/heketi_init_db.yml
- name: Create heketi DB volume
command: "{{ glusterfs_heketi_client }} setup-openshift-heketi-storage --image {{ glusterfs_heketi_image }} --listfile /tmp/heketi-storage.json **--durability none**"
register: setup_storage
>gluster peer status
Number of Peers: 1
Hostname: 10.1.1.2
Uuid: 1b8159e4-99e2-4f4d-ad95-e97bc8655d32
State: Peer in Cluster (Connected)
gluster volume info
Volume Name: heketidbstorage
Type: Distribute
Volume ID: 769419b9-d28f-4cdd-a8f3-708b6b738f65
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 10.1.1.1:/var/lib/heketi/mounts/vg_4187bfa3eb090ceffea9c53b156ddbd4/brick_80401b43be8c3c8a74417b18ad574524/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
Expected/desired behavior
I am expecting that on addition of every new node it should create new brick too
Details on how to reproduce (minimal and precise)
Add nodes in gluster cluster with below commands
ansible-playbook -i inventory2.ini /usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/config.yml -e openshift_upgrade_nodes_label="type=upgrade"
Information about the environment:
Heketi version used (e.g. v6.0.0 or master): OpenShift 3.10
Operating system used: CentOS
Heketi compiled from sources, as a package (rpm/deb), or container: Container
If container, which container image: docker.io/heketi/heketi:latest
Using kubernetes, openshift, or direct install: Openshift
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: outside
If kubernetes/openshift, how was it deployed (gk-deploy, openshift-ansible, other, custom): openshift-ansible
Just adding a node/server does not mean that the brick will also be added to existing
gluster volume.
You have to add that brick, hosted on new node, to existing volume.
command -
"gluster volume add-brick host:brick-path commit force"
Not sure if you have provided this command in your automation script or not.

Openshift: Error pulling image from remote, secure docker registry using certificates

I use the all-in-one VM of Openshift origin.
I am trying to pull images from a private, secure registry using an Image Stream. This is the ImageStream definition:
apiVersion: v1
kind: ImageStream
metadata:
name: my-image-stream
annotations:
description: Keeps track of changes in the application image
name: my-image
spec:
dockerImageRepository: "my.registry.net/myproject/my-image"
The repository is secured with a certificate. On my local machine, i have them in /etc/docker/certs.d/my.registry.net and I can login with docker login my.registry.net.
When I run oc import-image, however, I get the following error:
The import completed with errors.
Name: my-image
Namespace: myproject
Created: About an hour ago
Labels: <none>
Description: Keeps track of changes in the application image
Annotations: openshift.io/image.dockerRepositoryCheck=2017-01-27T08:09:49Z
Docker Pull Spec: 172.30.53.244:5000/myproject/my-image
Unique Images: 0
Tags: 1
latest
tagged from my.registry.net/myproject/my-image
! error: Import failed (InternalError): Internal error occurred: Get https://my.registry.net/v2/: remote error: handshake failure
About an hour ago
I have copied the certificates to the vagrant machine and restarted the docker daemon, but the problem remains. I have not found any documentation on how to properly add the certificates, so I just put them in the usual docker folder.
What is the appropriate way to make this work?
Update in response to rezie's answer:
There is no file etc/origin/master/ca-bundle.crt on my vagrant box. I found the following ca-bundle.crt files :
$ find / -iname ca-bundle.crt
/etc/pki/tls/certs/ca-bundle.crt
##multiple lines like
/var/lib/docker/devicemapper/mnt/something-hash-like/rootfs/etc/pki/tls/certs/ca-bundle.crt
/var/lib/origin/openshift.local.config/master/ca-bundle.crt
I appended the root certificate to /etc/pki/tls/certs/ca-bundle.crt and to var/lib/origin/openshift.local.config/master/ca-bundle.crt, but that did not change anything.
Please note, however, that I do not need to have this root certificate in /etc/docker/certs.d/... in order to login directly using docker login my.registry.net
I have appended
I cannot comment due tow lo karma so I'll write an answer saying almost the same as rezie.
The error:
! error: Import failed (InternalError): Internal error occurred: Get https://my.registry.net/v2/: remote error: handshake failure
About an hour ago
Comes from OpenShift, not from docker, therefore adding it to /etc/docker/certs.d/my.registry.net doesn't prevent the error from happening.
You should add the CA certificate at OS level, my guess is the steps failed for some reason so do it this way:
openssl s_client -connect my.registry.net:443 </dev/null |
sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' \
> /etc/pki/ca-trust/source/anchors/my.registry.net.crt &&
update-ca-trust check && update-ca-trust extract
Finally test if it worked running
curl https://my.registry.net/v2
If it doesn't give you a certificate error and you still can't do the oc import restart the atomic-openshift-master-api service
Try appending your CA (the same one you said you said that was used in the my.registry.net directory) into Openshift's ca bundle (e.g. /etc/origin/master/ca-bundle.crt. Then restart the service and reattempt import-image (making sure that you do not include the --insecure flag).
For reference, check out this issue from the Origin project. As you've mentioned, there's currently no way to supply certificates along with the dockercfg secret, and the suggestion from that issue is to add the CA as a trusted root CA across all the hosts.