Cannot connect to TPU with ssh on GCP - google-compute-engine

I was following the tutorial on https://cloud.google.com/tpu/docs/how-to.
I created a TPU instance, and tried to connect to it with gcloud compute ssh line. Then, this error occurred.
AppData\Local\Google\Cloud SDK>gcloud compute ssh node-1 --zone=asia-east1-c
PythonERROR: (gcloud.compute.ssh) Could not fetch resource:
- The resource 'projects/project-masker/zones/asia-east1-c/instances/node-1' was not found
Trying to solve this error, I found out that the tpus were not included in the execution group.
AppData\Local\Google\Cloud SDK>gcloud compute tpus list
PythonNAME ZONE ACCELERATOR_TYPE NETWORK RANGE STATUS
node-2 asia-east1-c v2-8 default 10.75.202.248/29 READY
node-1 asia-east1-c v2-8 default 10.82.81.168/29 READY
AppData\Local\Google\Cloud SDK>gcloud compute tpus execution-groups list
PythonListed 0 items.
This is what I got when I tried to restart the tpu.
PythonRequest issued for: [node-1]
Waiting for operation [projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-
e14800b7-d997be6b] to complete...done.
done: true
metadata:
'#type': type.googleapis.com/google.cloud.common.OperationMetadata
apiVersion: v1
cancelRequested: false
createTime: '2021-07-03T08:00:49.884674545Z'
endTime: '2021-07-03T08:01:31.161199334Z'
target: projects/project-masker/locations/asia-east1-c/nodes/node-1
verb: update
name: projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-e14800b7-d997be6b
response:
'#type': type.googleapis.com/google.cloud.tpu.v1.Node
acceleratorType: v2-8
apiVersion: V1
cidrBlock: 10.82.81.168/29
createTime: '2021-07-03T07:27:41.148997156Z'
health: HEALTHY
ipAddress: 10.82.81.170
name: projects/project-masker/locations/asia-east1-c/nodes/node-1
network: global/networks/default
networkEndpoints:
- ipAddress: 10.82.81.170
port: 8470
port: '8470'
schedulingConfig: {}
serviceAccount: service-...#cloud-tpu.iam.gserviceaccount.com
state: READY
tensorflowVersion: pytorch-1.9
I tried to find some related articles on google, but I couldn't find any. How can I fix this?

You can't SSH to a TPU node directly, so gcloud compute ssh {tpu_name} isn't expected to work.
You can, however, SSH directly to a TPU VM, please see this link. If you are already using TPU VM, then your issue is that you're trying
gcloud compute ssh
rather than
gcloud alpha compute tpus tpu-vm ssh ...

Related

Setting vm.max_map_count with tuned

Trying to set vm.max_map_count with the node tuning operator and the openshift ClusterLogging operator. Openshift version is 4.9.17, cluster logging and elasticsearch operators are latest.
This is my tuned configuration:
apiVersion: tuned.openshift.io/v1
kind: Tuned
name: common-services-es
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Optimize systems running ES on OpenShift nodes
include=openshift-node
[sysctl]
vm.max_map_count=262144
name: common-services-es
recommend:
- match:
- label: component
type: pod
value: elasticsearch
priority: 5
profile: common-services-es
My ClusterLogging operator configuration is the default operator, and I can verify the labels component=elasticsearch on the pod.
Getting the pod logs with the following command
for p in `oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned -o=jsonpath='{range .items[*]}{.metadata.name} {end}'`; do printf "\n*** $p ***\n" ; oc logs pod/$p -n openshift-cluster-node-tuning-operator | grep applied; done
returns tuned.daemon.daemon: static tuning from profile 'common-services-es' applied on all 3 of my es nodes, but the elasticsearch pod still fails to start with the error max virtual memory areas vm.max_map_count [253832] is too low, increase to at least [262144] and running sysctl vm.max_map_count on the nodes confirm the value is 253832.
Turns out that IBM Cloud openshift doesn't use machineconfigs, and tuned uses machineconfigs.

Google Cloud Build windows builder error "Failed to get external IP address: Could not get external NAT IP from list"

I am trying to implement automatic deployments for my Windows Kubernetes container app. I'm following instructions from the Google's windows-builder, but the trigger quickly fails with this error at about 1.5 minutes in:
2021/12/16 19:30:06 Set ingress firewall rule successfully
2021/12/16 19:30:06 Failed to get external IP address: Could not get external NAT IP from list
ERROR
ERROR: build step 0 "gcr.io/[my-project-id]/windows-builder" failed: step exited with non-zero status: 1
The container, gcr.io/[my-project-id]/windows-builder, definitely exists and it's located in the same GCP project as the Cloud Build trigger just as the windows-builder documentation commanded.
I structured my code based off of Google's docker-windows example. Here is my repository file structure:
repository
cloudbuild.yaml
builder.ps1
worker
Dockerfile
Here is my cloudbuild.yaml:
steps:
# WORKER
- name: 'gcr.io/[my-project-id]/windows-builder'
args: [ '--command', 'powershell.exe -file build.ps1' ]
# OPTIONS
options:
logging: CLOUD_LOGGING_ONLY
Here is my builder.ps1:
docker build -t gcr.io/[my-project-id]/test-worker ./worker;
if ($?) {
docker push gcr.io/[my-project-id]/test-worker;
}
Here is my Dockerfile:
FROM gcr.io/[my-project-id]/test-windows-node-base:onbuild
Does anybody know what I'm doing wrong here? Any help would be appreciated.
Replicated the steps from GitHub and got the same error. It is throwing Failed to get external IP address... error because the External IP address of the VM is disabled by default in the source code. I was able to build it successfully by adding '--create-external-ip', 'true' in cloudbuild.yaml.
Here is my cloudbuild.yaml:
steps:
- name: 'gcr.io/$PROJECT_ID/windows-builder'
args: [ '--create-external-ip', 'true',
'--command', 'powershell.exe -file build.ps1' ]

route to application stopped working in OpenShift Online 3.9

I have an application running in Openshift Online starter, which worked for the last 5 months. A single pod behind a service with a route defined that does edge tls termination.
Since Saturday, when trying to access the application, I get the error message
Application is not available
The application is currently not serving requests at this endpoint. It may not have been started or is still starting.
Possible reasons you are seeing this page:
The host doesn't exist. Make sure the hostname was typed correctly and that a route matching this hostname exists.
The host exists, but doesn't have a matching path. Check if the URL path was typed correctly and that the route was created using the desired path.
Route and path matches, but all pods are down. Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
The pod is running, I can exec into it and check this, I can port-forward to it and access it.
checking the different components with oc:
$ oc get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE
taboo3-23-jt8l8 1/1 Running 0 1h 10.128.37.90 ip-172-31-30-113.ca-central-1.compute.internal
$ oc get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
taboo3 172.30.238.44 <none> 8080/TCP 151d
$ oc describe svc taboo3
Name: taboo3
Namespace: sothawo
Labels: app=taboo3
Annotations: openshift.io/generated-by=OpenShiftWebConsole
Selector: deploymentconfig=taboo3
Type: ClusterIP
IP: 172.30.238.44
Port: 8080-tcp 8080/TCP
Endpoints: 10.128.37.90:8080
Session Affinity: None
Events: <none>
$ oc get route
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
taboo3 taboo3-sothawo.193b.starter-ca-central-1.openshiftapps.com taboo3 8080-tcp edge/Redirect None
I tried to add a new route as well (with or without tls), but am getting the same error.
Does anybody have an idea what might be causing this and how to fix it?
Addition April 17, 2018: Got an email from Openshift Online support:
It looks like you may be affected by this bug.
So waiting for it to be resolved.
The problem has been resolved by Openshift Online, the application is working again

GCE - No stackdriver memory metrics for nodes

I have set up my Kubernetes 1.3.4 cluster on GCE with
export KUBE_ENABLE_CLUSTER_MONITORING=google
This works quite nicely, I get application logs (for some reason in the Container Engine section, but well) and also pod and node metrics.
The only thing that is missing are the node memory metrics, only CPU is shown (see screenshot)
No memory metrics
In the heapster logs I see tons of lines like this
{
metadata: {
severity: "ERROR"
projectId: "<project-id>"
serviceName: "container.googleapis.com"
zone: "europe-west1-d"
labels: {
container.googleapis.com/cluster_name: "production"
compute.googleapis.com/resource_type: "instance"
compute.googleapis.com/resource_name: "fluentd-cloud-logging-production-minion-group-p0w8"
container.googleapis.com/instance_id: "6772154497331326454"
container.googleapis.com/pod_name: "heapster-v1.1.0-2102007506-23b3e"
compute.googleapis.com/resource_id: "6772154497331326454"
container.googleapis.com/stream: "stderr"
container.googleapis.com/namespace_name: "kube-system"
container.googleapis.com/container_name: "heapster"
}
timestamp: "2016-09-13T14:40:08.000Z"
projectNumber: "930564692351"
}
textPayload: "E0913 14:40:08.665035 1 gcm.go:179] Error while sending request to GCM googleapi: Error 400: Timeseries 76, point: start is not older than end, for a cumulative metric, invalidParameter
"
insertId: "pt5bo7g132r266"
log: "heapster"
}
Not sure if this is related.
Any ideas?
If you are running your cluster using GCE instead of GKE
You should install the stackdriver agent and verify the credentials that agent is using to communicate with stackdriver link
If you are using linux you can install the agent by executing:
curl -sSO https://dl.google.com/cloudagents/install-monitoring-agent.sh
sudo bash install-monitoring-agent.sh
and you can check your credentials running the following command:
sudo cat $GOOGLE_APPLICATION_CREDENTIALS
sudo cat /etc/google/auth/application_default_credentials.json

Google compute global forwarding rule asking for region

I am trying to deploy a global forwarding rule. My yaml file is below
resources:
- name: rule
type: compute.v1.forwardingRule
properties:
portRange: 80-80
IPProtocol: TCP
target: projects/{{ env["project"] }}/global/targetHttpProxies/myproxy
IPAddress: xx.xx.xx.xx
When i run the command :
gcloud deployment-manager deployments create grule --config test.yaml
It is giving error saying resource properties region is required. It is asking for region, but i am trying to create a global forwarding rule for which I need not give region.
Maybe it should be compute.v1.globalForwardingRule?