Ingress Nginx cant tolerate Master taint - kubernetes-ingress

Problem
When trying to install ingress-nginx on a single node (also master) Kubernetes cluster, the Helm install fails complaining pod can't be scheduled on master as it cant tolerate the taint of master:
- FailedScheduling
- pod/ingress-nginx-admission-create--1-n7bhg
- 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Details
Kubernetes :
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Helm Version:
version.BuildInfo{Version:"v3.7.0", GitCommit:"eeac83883cb4014fe60267ec6373570374ce770b", GitTreeState:"clean", GoVersion:"go1.16.8"}
Installation steps followed : ( from documentation at https://kubernetes.github.io/ingress-nginx/deploy/#using-helm )
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx
Cluster node:
ip-172-29-1-103 Ready control-plane,master 81m v1.22.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-172-29-1-103,kubernetes.io/os=linux,mitg.cisco.com/node-type=pats,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
Removing the master node taint doesn't look right for other reasons. What would be a solution ?

In general, to get workloads schedules on the Kubernetes Control Plane (I.e. Master nodes), you need to do the following:
kubectl taint nodes --all node-role.kubernetes.io/master-
or 1.7 and above:
kubectl taint node mymasternode node-role.kubernetes.io/master:NoSchedule-
In order for you to find out what the master node is currently tainted with you can describe the node and look at the labels and taints associated with the node in question. What this does is it will untaint the master node and allow workloads to be scheduled to that node. Essentially find the taint fo the node in question and untaint the master node in question that is preventing it. Without any descriptions on your node, or the resources that are failing that's the best advice I can give. So it sounds like you didn't properly remove the taint that prevents scheduling to your master node, which by default workloads are restricted to the master node.
You can also spin up a worker node and try to join it to your cluster to overcome the issue and see if it gets scheduled to the joined worker.
My best advice is to find your taint:
kubectl describe node <insert-node-name-here>
Find the taints and/or tolerations that are preventing it and remove it.
Read through the following to see if it helps you:
https://kubernetes.io/docs/concepts/architecture/nodes/
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
NOTE: THIS SHOULD NOT BE DONE ON PRODUCTION CLUSTERS, SO I ASSUME THIS IS A DEVELOPMENT CLUSTER YOU ARE WORKING WITH *****

Related

Unable to connect worker node to master using K3S

I am trying to setup a K3S cluster for learning purposes but I am having trouble connecting the master node with agents. I have looked several tutorials and discussions on this but I can't find a solution. I know I am probably missing something obvious (due to my lack of knowledge), but still help would be much appreciated.
I am using two AWS t2.micro instances with default configuration.
When ssh into the master and installed K3S using
curl -sfL https://get.k3s.io | sh -s - --no-deploy traefik --write-kubeconfig-mode 644 --node-name k3s-master-01
with kubectl get nodes, I am able to see the master
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 13s v1.23.6+k3s1
So far it seems I am doing things right. From what I understand, I am supposed to configure the kubeconfig file. So, I accessed it by using
cat /etc/rancher/k3s/k3s.yaml
I copied the configuration file and the server info to match the private IP I took from AWS console, resulting in something like this
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <lots_of_info>
server: https://<master_private_IP>:6443
name: default
contexts:
- context:
cluster: default
user: default
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
user:
client-certificate-data: <my_certificate_data>
client-key-data: <my_key_data>
Then, I ran vi ~/.kube/config, and there I pasted the kubeconfig file
Finally, I grabbed the token with cat /var/lib/rancher/k3s/server/node-token, ssh into the other machine and then run the following
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-01 K3S_URL=https://<master_private_IP>:6443 K3S_TOKEN=<master_token> sh -
The output is
[INFO] Finding release for channel stable
[INFO] Using v1.23.6+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/sha256sum-amd64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.23.6+k3s1/k3s
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Creating /usr/local/bin/ctr symlink to k3s
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-agent-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s-agent.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s-agent.service
[INFO] systemd: Enabling k3s-agent unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s-agent.service → /etc/systemd/system/k3s-agent.service.
[INFO] systemd: Starting k3s-agent
By this output, it looks like I have created an agent. However, when I run kubectl get nodes in the master, I still get
NAME STATUS ROLES AGE VERSION
k3s-master-01 Ready control-plane,master 12m v1.23.6+k3s1
What is the thing I was supposed to do in order to get the agent connected to the master? I am guess I am probably missing something simple, but I just can't seem to find the solution. I've read all the documentation but it is still not clear to me where I am making the mistake. I've tried saving the private master IP and token into the agent as environmental variables with export K3S_TOKEN=master_token and K3S_URL=master_private_IP and then simply running curl -sfL https://get.k3s.io | sh - but I still can't see the worker nodes when running kubectl get nodes
Any help would be appreciated.
It might be your VM instance firewall that prevents appropriate connection from your master to the worker node (and vice versa). Official rancher documentation advise to disable firewall for (Red Hat/CentOS) Enterprise Linux:
It is recommended to turn off firewalld:
systemctl disable firewalld --now
If enabled, it is required to disable nm-cloud-setup and reboot the node:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer reboot
If you are using Ubuntu on your VM's, there is a different firewall tool (ufw).
In my case, allowing 6443 and 443(not sure if required) port TCP connections worked fine.
Allow port 6443 and TCP connection in all of your cluster machines:
sudo ufw allow 6443/tcp
Then apply k3s installation script in your worker node(s):
curl -sfL https://get.k3s.io | K3S_NODE_NAME=k3s-worker-1 K3S_URL=https://<k3s-master-1 IP>:6443 K3S_TOKEN=<k3s-master-1 TOKEN> sh -
This should work. If not, you can try adding additional allow rule for 443 tcp port as well.
A few options to check.
Check Journalctl for errors
journalctl -u k3s-agent.service -n 300 -xn
If using RaspberryPi for a worker node, make sure you have
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
as the very end of your /boot/cmdline.txt file. DO NOT PUT THIS VALUE ON A NEW LINE! Should just be appended to the end of the line.
If your master node(s) have self-signed certs, make sure you copy the master node's self signed cert to your worker node(s). In linux or raspberry pi copy cert to /usr/local/share/ca-certificates, then issue an
sudo update-ca-certificates
on the worker node
Don't forget to reboot the worker node after you make these changes!
Hope this helps someone!

How to deploy OpenShift Origin 10 (OKD) on one node with GlusterFS

I am able to install OKD on one node and scaleup on multiple node accordngly.
But now i want to install OKD with GlusterFS on one node and then extend this on multiple nodes.
Currently i am getting error that at least three nodes required. How i can bypass this check in ansible?
As per github documentations i have three options
Configuring a new, natively-hosted GlusterFS cluster. In this scenario, GlusterFS pods are deployed on nodes in the OpenShift cluster which are configured to provide storage.
Configuring a new, external GlusterFS cluster. In this scenario, the cluster nodes have the GlusterFS software pre-installed but have not been configured yet. The installer will take care of configuring the cluster(s) for use by OpenShift applications.
Using existing GlusterFS clusters. In this scenario, one or more GlusterFS clusters are assumed to be already setup. These clusters can be either natively-hosted or external, but must be managed by a heketi service.
Can option 2 or 3 be used to start with one node and extend accordingly? I have install glusterfs cluster on one node and extend it to second node but how to introduce in openshift?
https://imranrazakh.blogspot.com/2018/08/
I found one way to install glusterfs on one node, Find below all in one installation with glusterfs
Changed inventory file like below
[OSEv3:children]
masters
nodes
etcd
glusterfs
[OSEv3:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
ansible_ssh_user=root
openshift_deployment_type=origin
openshift_enable_origin_repo=false
openshift_disable_check=disk_availability,memory_availability
os_firewall_use_firewalld=true
openshift_public_hostname=console.1.1.0.1.nip.io
openshift_master_default_subdomain=apps.1.1.0.1.nip.io
openshift_storage_glusterfs_is_native=false
openshift_storage_glusterfs_storageclass=true
openshift_storage_glusterfs_heketi_is_native=true
openshift_storage_glusterfs_heketi_executor=ssh
openshift_storage_glusterfs_heketi_ssh_port=22
openshift_storage_glusterfs_heketi_ssh_user=root
openshift_storage_glusterfs_heketi_ssh_sudo=false
openshift_storage_glusterfs_heketi_ssh_keyfile="/root/.ssh/id_rsa
[masters]
1.1.0.1 openshift_ip=1.1.0.1 openshift_schedulable=true
[etcd]
1.1.0.1 openshift_ip=1.1.0.1
[nodes]
1.1.0.1 openshift_ip=1.1.0.1 openshift_node_group_name="node-config-all-in-one" openshift_schedulable=true
[glusterfs]
1.1.0.1 glusterfs_devices='[ "/dev/vdb" ]'
Now we have to hack ansible script as it expect three nodes by adding --durability none in following ansible script
openshift-ansible/roles/openshift_storage_glusterfs/tasks/heketi_init_db.yml
Following is updated snippet
- name: Create heketi DB volume
command: "{{ glusterfs_heketi_client }} setup-openshift-heketi-storage --image {{ glusterfs_heketi_image }} --listfile /tmp/heketi-storage.json --durability none"
register: setup_storage
As by default it create StorageClass which expect replicate environment, so we have to create custom storageclass like below with "volumetype: none"
oc create -f - <<EOT
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: glusterfs-nr-storage
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
parameters:
resturl: http://heketi-storage.glusterfs.svc:8080
restuser: admin
secretName: heketi-storage-admin-secret
secretNamespace: glusterfs
volumetype: none
provisioner: kubernetes.io/glusterfs
volumeBindingMode: Immediate
EOT
Now you can create storage dynamically from webconsole :) Any suggestions for improvement are welcome.
Next i will check how i can extend it?

Changing the default behavior of Kubernetes

I have setup a K8S cluster (1 master and 2 slaves) using Kubeadm on my laptop.
Deployed 6 replicas of a pod. 3 of them got deployed to each of the slaves.
Did a shutdown of one of the slave.
It took ~6 minutes for the 3 pods to be scheduled on the running node.
Initially, I thought that it had to do something with the K8S setup. After some digging found out, it's because of the defaults in the K8S for Controller Manager and Kubelet as mentioned here. It made sense. I checked out the K8S documentation on where to change the configuration properties and also checked the configuration files on the cluster node, but couldn't figure it out.
kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)
Could someone point out what needs to be done to make the above-mentioned configuration changes permanent and also the different options for the same?
On the kubelet change this file on all your nodes:
/var/lib/kubelet/kubeadm-flags.env
Add the option at the end or anywhere on this line:
KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --cni-bin-dir=/opt/cni/bin
--cni-conf-dir=/etc/cni/net.d --network-plugin=cni
--resolv-conf=/run/systemd/resolve/resolv.conf
--node-status-update-frequency=10s <== add this
On your kube-controller-manager change on the master the following file:
/etc/kubernetes/manifests/kube-controller-manager.yaml
In this section:
containers:
- command:
- kube-controller-manager
- --address=127.0.0.1
- --allocate-node-cidrs=true
- --cloud-provider=aws
- --cluster-cidr=192.168.0.0/16
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --node-cidr-mask-size=24
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --use-service-account-credentials=true
- –-node-monitor-period=5s <== add this line
On your master do a sudo systemctl restart docker
On all your nodes do a sudo systemctl restart kubelet
You should have the new configs take effect.
Hope it helps.

Openshift-origin not create ha-router

I tried to create an OpenShift haproxy router, in openshift origin, with the cli:
oadm router router-ha --service-account=router --type="haproxy-router" --subdomain="${name}-${namespace}.op37.dev.procempa.com.br" --replicas=2 --selector='region=infra' --selector='zone=default'
But, router not create. I have 2 nodes in region=infra, the error is:
2 nodes are available: 1 CheckServiceAffinity, 1 MatchNodeSelector, 2 PodFitsHostPorts.
My openshift-origin is:
Version
OpenShift Master:
v3.7.0+7ed6862
Kubernetes Master:
v1.7.6+a08f5eeb62
OK let's take a look, we'll need to understand what the error is with the deployment of the router, which I'm guessing is most likely related to the nodes and their labels. Can you please run each of the commands below and add the output to your question?
# Get the nodes, showing what labels they have.
oc get nodes -o wide
# Get the recent deployments.
oc get deploy
# For good measure, let's check some status and recent events.
oc status
oc get events
That should provide a lot of diagnostic data to help!

How to make oc cluster up persistent?

I'm using "oc cluster up" to start my Openshift Origin environment. I can see, however, that once I shutdown the cluster my projects aren't persisted at restart. Is there a way to make them persistent ?
Thanks
There are a couple ways to do this. oc cluster up doesn't have a primary use case of persisting resources.
There are couple ways to do it:
Leverage capturing etcd as described in the oc cluster up README
There is a wrapper tool, that makes it easy to do this.
There is now an example in the cluster up --help command, it is bound to stay up to date so check that first
oc cluster up --help
...
Examples:
# Start OpenShift on a new docker machine named 'openshift'
oc cluster up --create-machine
# Start OpenShift using a specific public host name
oc cluster up --public-hostname=my.address.example.com
# Start OpenShift and preserve data and config between restarts
oc cluster up --host-data-dir=/mydata --use-existing-config
So specifically in v1.3.2 use --host-data-dir and --use-existing-config
Assuming you are using docker machine with vm such as virtual box, the easiest way I found is taking a vm snapshot WHILE vm and openshift cluster are up and running. This snapshot will backup memory in addition to disk therefore you can restore entire cluster later on by restoring the vm snapshot, then run docker-machine start ...
btw, as of latest os image openshift/origin:v3.6.0-rc.0 and oc cli, --host-data-dir=/mydata as suggested in the other answer doesn't work for me.
I'm using:
VirtualBox 5.1.26
Kubernetes v1.5.2+43a9be4
openshift v1.5.0+031cbe4
Didn't work for me using --host-data-dir (and others) :
oc cluster up --logging=true --metrics=true --docker-machine=openshift --use-existing-config=true --host-data-dir=/vm/data --host-config-dir=/vm/config --host-pv-dir=/vm/pv --host-volumes-dir=/vm/volumes
With output:
-- Checking OpenShift client ... OK
-- Checking Docker client ...
Starting Docker machine 'openshift'
Started Docker machine 'openshift'
-- Checking Docker version ...
WARNING: Cannot verify Docker version
-- Checking for existing OpenShift container ... OK
-- Checking for openshift/origin:v1.5.0 image ... OK
-- Checking Docker daemon configuration ... OK
-- Checking for available ports ... OK
-- Checking type of volume mount ...
Using Docker shared volumes for OpenShift volumes
-- Creating host directories ... OK
-- Finding server IP ...
Using docker-machine IP 192.168.99.100 as the host IP
Using 192.168.99.100 as the server IP
-- Starting OpenShift container ...
Starting OpenShift using container 'origin'
FAIL
Error: could not start OpenShift container "origin"
Details:
Last 10 lines of "origin" container log:
github.com/openshift/origin/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc4202a1600, 0x42b94c0, 0x1f, 0xc4214d9f08, 0x2, 0x2)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x16a
github.com/openshift/origin/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc4209f84c0, 0x33, 0x5f5e100, 0x2710, 0xc4214d9fa8)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:106 +0x341
github.com/openshift/origin/vendor/github.com/coreos/etcd/mvcc/backend.NewDefaultBackend(0xc4209f84c0, 0x33, 0x461e51, 0xc421471200)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:100 +0x4d
github.com/openshift/origin/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4204bf640, 0xc4209f84c0, 0x33, 0xc421079a40)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/coreos/etcd/etcdserver/server.go:272 +0x39
created by github.com/openshift/origin/vendor/github.com/coreos/etcd/etcdserver.NewServer
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/coreos/etcd/etcdserver/server.go:274 +0x345
Openshift writes to the directories /vm/... (also defined in VirtualBox) but successfully won't start.
See [https://github.com/openshift/origin/issues/12602][1]
Worked for me too, using Virtual Box Snapshots and restoring them.
To make it persistent after each shutdown you need to provide base-dir parameter.
$ mkdir ~/openshift-config
$ oc cluster up --base-dir=~/openshift-config
From help
$ oc cluster up --help
...
Options:
--base-dir='': Directory on Docker host for cluster up configuration
--enable=[*]: A list of components to enable. '*' enables all on-by-default components, 'foo' enables the component named 'foo', '-foo' disables the component named 'foo'.
--forward-ports=false: Use Docker port-forwarding to communicate with origin container. Requires 'socat' locally.
--http-proxy='': HTTP proxy to use for master and builds
--https-proxy='': HTTPS proxy to use for master and builds
--image='openshift/origin-${component}:${version}': Specify the images to use for OpenShift
--no-proxy=[]: List of hosts or subnets for which a proxy should not be used
--public-hostname='': Public hostname for OpenShift cluster
--routing-suffix='': Default suffix for server routes
--server-loglevel=0: Log level for OpenShift server
--skip-registry-check=false: Skip Docker daemon registry check
--write-config=false: Write the configuration files into host config dir
But you shouln't use it, because "cluster up" is removed in version 4.0.0. More here: https://github.com/openshift/origin/pull/21399