GCE Instance Not Found - google-compute-engine

I'm trying to set up a Kubernetes cluster on GCE using CoreOS as the base OS. But I'm having the following issue when trying to make the cluster a multizone cluster by setting the --cloud-provider and --cloud-config flags.
The below is the output from the API Server on the master node:
Jun 15 09:22:09 cos-000-pub-pvt-master.c.project-id.internal kubelet-wrapper[1098]: E0615 09:22:09.790068 1098 gce.go:2380] Failed to retrieve instance: "10.0.0.2"
Jun 15 09:22:09 cos-000-pub-pvt-master.c.project-id.internal kubelet-wrapper[1098]: E0615 09:22:09.790125 1098 gce.go:2414] getInstanceByName/multiple-zones: failed to get instance 10.0.0.2; err: instance not found
Jun 15 09:22:09 cos-000-pub-pvt-master.c.project-id.internal kubelet-wrapper[1098]: E0615 09:22:09.790151 1098 kubelet.go:1131] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: instance not found
When running kubectl get nodes there is no output, but when running kubectl --namespace kube-system get pods I see the API Server, Controller Manager, Scheduler and each of the Proxies for each of the nodes. Although I can see them they are restarted every 45-60 seconds.
The GCE config file is as follows:
[GLOBAL]
multizone=true
If I've left something out that can help let me know.

It seems that the --hostname-override flag was causing this issue. I've removed that and the master is now able to find the node in the GCE API.

Related

Unable to connect worker node to master using K3S

I am trying to setup a K3S cluster for learning purposes but I am having trouble connecting the master node with agents. I have looked several tutorials and discussions on this but I can't find a solution. I know I am probably missing something obvious (due to my lack of knowledge), but still help would be much appreciated.
I am using two AWS t2.micro instances with default configuration.
When ssh into the master and installed K3S using
curl -sfL https://get.k3s.io | sh -s - --no-deploy traefik --write-kubeconfig-mode 644 --node-name k3s-master-01
with kubectl get nodes, I am able to see the master
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 13s v1.23.6+k3s1
So far it seems I am doing things right. From what I understand, I am supposed to configure the kubeconfig file. So, I accessed it by using
cat /etc/rancher/k3s/k3s.yaml
I copied the configuration file and the server info to match the private IP I took from AWS console, resulting in something like this
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <lots_of_info>
server: https://<master_private_IP>:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
client-certificate-data: <my_certificate_data>
client-key-data: <my_key_data>
Then, I ran vi ~/.kube/config, and there I pasted the kubeconfig file
Finally, I grabbed the token with cat /var/lib/rancher/k3s/server/node-token, ssh into the other machine and then run the following
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-01 K3S_URL=https://<master_private_IP>:6443 K3S_TOKEN=<master_token> sh -
The output is
[INFO] Finding release for channel stable
[INFO] Using v1.23.6+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s-agent.service
[INFO] systemd: Enabling k3s-agent unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.
[INFO] systemd: Starting k3s-agent
By this output, it looks like I have created an agent. However, when I run kubectl get nodes in the master, I still get
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 12m v1.23.6+k3s1
What is the thing I was supposed to do in order to get the agent connected to the master? I am guess I am probably missing something simple, but I just can't seem to find the solution. I've read all the documentation but it is still not clear to me where I am making the mistake. I've tried saving the private master IP and token into the agent as environmental variables with export K3S_TOKEN=master_token and K3S_URL=master_private_IP and then simply running curl -sfL https://get.k3s.io | sh - but I still can't see the worker nodes when running kubectl get nodes
Any help would be appreciated.
It might be your VM instance firewall that prevents appropriate connection from your master to the worker node (and vice versa). Official rancher documentation advise to disable firewall for (Red Hat/CentOS) Enterprise Linux:
It is recommended to turn off firewalld:
systemctl disable firewalld --now
If enabled, it is required to disable nm-cloud-setup and reboot the node:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer reboot
If you are using Ubuntu on your VM's, there is a different firewall tool (ufw).
In my case, allowing 6443 and 443(not sure if required) port TCP connections worked fine.
Allow port 6443 and TCP connection in all of your cluster machines:
sudo ufw allow 6443/tcp
Then apply k3s installation script in your worker node(s):
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-1 K3S_URL=https://<k3s-master-1 IP>:6443 K3S_TOKEN=<k3s-master-1 TOKEN> sh -
This should work. If not, you can try adding additional allow rule for 443 tcp port as well.
A few options to check.
Check Journalctl for errors
journalctl -u k3s-agent.service -n 300 -xn
If using RaspberryPi for a worker node, make sure you have
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
as the very end of your /boot/cmdline.txt file. DO NOT PUT THIS VALUE ON A NEW LINE! Should just be appended to the end of the line.
If your master node(s) have self-signed certs, make sure you copy the master node's self signed cert to your worker node(s). In linux or raspberry pi copy cert to /usr/local/share/ca-certificates, then issue an
sudo update-ca-certificates
on the worker node
Don't forget to reboot the worker node after you make these changes!
Hope this helps someone!

GCP deployment fails on "Updating service"

I have asp.net core application hosted on GCP App Engine. When I try to deploy the application it fails on last step:
Updating service [name] (this may take several minutes)... ...failed
ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>blablabla.wm.1
The exception stack trace show that service running in background couldn't find MySQL table (that table obviously exists).
my app.yaml file:
service: XXX
runtime: custom
env: flex
automatic_scaling:
max_concurrent_requests: 80
min_num_instances: 1
max_num_instances: 1
resources:
cpu: XXX
memory_gb: XXX
beta_settings:
cloud_sql_instances: "XXX:XXXX:XXXX=tcp:3306"
It looks like the application is deployed properly despite the error. This is the only error and backgroud service desn't throw any exceptions at later point. In fact it works properly and can connect to the database.
My guess was that maybe GCP is checking health while the application is not connected do database. So I tried to add liveness_check and readiness_check to app.yaml and configured dedicated /healthcheck endpoint in my application but it didn't make any change.
Any ideas how to fix it and what might be a cause?
Deploying app with new version fixed the issue

How to view Routes pod in OpenShift

I have created a routes for my service in the OpenShift,
oc get routes
NAME HOST/PORT PATH SERVICES PORT
simplewebserver simpleweb.apps.devcluster.os.fly.com simplewebserver 9999
When I ran command: curl http://simpleweb.apps.devcluster.os.fly.com/world
it failed to access my web service. I suspect my route has some problem, but I could not see any route debug information.
My question is, how to find the route pod in the OpenShift Or how to find some route activity information when I access route?
You can check the router logs in logs container of router pods. in our OCP cluster i could see router pods in openshift-ingress namespace.
oc get pods -n openshift-ingress
NAME READY STATUS RESTARTS AGE
router-default-5f9c4b6cb4-12121a 2/2 Running 0 40h
router-default-5f9c4b6cb4-12133a 2/2 Running 0 40h
To get the logs, use below command,
oc -n openshift-ingress -c logs logs -f <router_pod_name>
Also make sure haproxy logs are enabled to find out urls getting hit via router.
https://access.redhat.com/solutions/3397701
As there is limited information about your problem. Here are few things you can try.
Try to curl using a port
curl -kv http://simpleweb.apps.devcluster.os.fly.com:9999
Access the pod logs for which the route was created. Check the service simplewebserver is using the correct selector to route the traffic to the pod.
Do a oc describe service simplewebserver to see the selectors being used.
Check if any network policy is blocking the external traffic.
Check if you can access the target pod using that service from within the same namespace. You can do that by rsh to a pod and then access the service using:
curl -kv http://servicename.projectname.svc.cluster.local

route to application stopped working in OpenShift Online 3.9

I have an application running in Openshift Online starter, which worked for the last 5 months. A single pod behind a service with a route defined that does edge tls termination.
Since Saturday, when trying to access the application, I get the error message
Application is not available
The application is currently not serving requests at this endpoint. It may not have been started or is still starting.
Possible reasons you are seeing this page:
The host doesn't exist. Make sure the hostname was typed correctly and that a route matching this hostname exists.
The host exists, but doesn't have a matching path. Check if the URL path was typed correctly and that the route was created using the desired path.
Route and path matches, but all pods are down. Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
The pod is running, I can exec into it and check this, I can port-forward to it and access it.
checking the different components with oc:
$ oc get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE
taboo3-23-jt8l8 1/1 Running 0 1h 10.128.37.90 ip-172-31-30-113.ca-central-1.compute.internal
$ oc get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
taboo3 172.30.238.44 <none> 8080/TCP 151d
$ oc describe svc taboo3
Name: taboo3
Namespace: sothawo
Labels: app=taboo3
Annotations: openshift.io/generated-by=OpenShiftWebConsole
Selector: deploymentconfig=taboo3
Type: ClusterIP
IP: 172.30.238.44
Port: 8080-tcp 8080/TCP
Endpoints: 10.128.37.90:8080
Session Affinity: None
Events: <none>
$ oc get route
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
taboo3 taboo3-sothawo.193b.starter-ca-central-1.openshiftapps.com taboo3 8080-tcp edge/Redirect None
I tried to add a new route as well (with or without tls), but am getting the same error.
Does anybody have an idea what might be causing this and how to fix it?
Addition April 17, 2018: Got an email from Openshift Online support:
It looks like you may be affected by this bug.
So waiting for it to be resolved.
The problem has been resolved by Openshift Online, the application is working again

Openshift 3 , 503 Error (No server is available to handle this request)

I have created a web application using jsp/tiles/struts/mysql/tomcat. I created new project on Openshift 3 console (Openshift online) https://console.preview.openshift.com/console/ then added tomcat/mySql. I was getting 503 error sometimes and other times, same page was working as expected. 503 error came randomly for any page from my project. When I get 503 error, I refresh some no of times and it goes away, and my page is correctly displayed.
Error that I see is:
"503 Service Unavailable
No server is available to handle this request. "
I did some research:
What I understand from this openshift 2 link:
https://blog.openshift.com/how-to-host-your-java-ee-application-with-auto-scaling/
is that to correct 503 error:
SSH into your application gear using rhc ssh --app <app_name>
Change directory to haproxy/conf
change the following in haproxy.cfg option httpchk GET / to option httpchk GET /api/v1/ping
Restart the HAProxy cartridge from your local machine using RHC rhc cartridge-restart --cartridge haproxy
I dont know if it is also applicable to openshift 3. In openshift 3 where is haproxy.log, haproxy.cfg, haproxy/conf or its slightly different in openshift 3. (Nut thanks to Warrens comments, yes he saw 503 error in openshift related to HAProxy)
Now after 1 week after posting this question:
I am getting Quota Reached Error. I am able to build my project but all deployments are failing. I wonder if 503 error that I was getting earlier(either completely or partially) was related to Quota reached. How should I proceed now.
curl -i localhost:8080/GEA
HTTP/1.1 302 Found Server:
Apache-Coyote/1.1
Location: http://localhost:8080/GEA/
Transfer-Encoding: chunked Date: Tue, 11 Apr 2017 18:03:25 GMT
Tomcat logs do not show any application error.
Will Readiness Probe and Liveness Probe help me? I have not set them yet.
Nor do I know how to set them.
Will scaling help me (I dont know how to set it either)
Do I have to set memory/... all at maximum allowed to ensure project runs smooth?
For me I had a similar situation of getting 503's sometimes and sometimes getting my actual page. the reason was because you have haproxy on the frontend handling the requests. Depending on your setup you may even have a few haproxy pods and your request could be funneled between one of the pods. So as in my case one pod was working and the other not.
So basically
oc get pods -n default
NAME READY STATUS RESTARTS AGE
docker-registry-7-i02rh 1/1 Running 0 75d
registry-console-12-wciib 1/1 Running 0 67d
router-1-533cg 1/1 Running 3 76d
router-1-9utld 1/1 Running 1 76d
router-1-uwf64 1/1 Running 1 76d
As you can see in my output default namespace is where my router(haproxy) pods live. If I change to that namespace
oc project default
Then run
oc logs -f router-1-533cg
on each of the pods you will most likely find a sepcific pod that is behaving bad. You can simply delete, and the replication controller will create a new one