Openshift pods stuck in Init:0/1 status - openshift

I am deploying microservices in my openshift cluster but I can see out of 90 microservices nearly 10 got stuck in Init:0/1 status. Is there a way to troubleshoot the issue??

If you are using web page UI.
Go to your developer tab, and go to project.
There you should see the recent events in your project where the errors related to the 0/1 pods stuck state should appear. For me it was something like
Error creating: pods "xxxx" is forbidden: exceeded quota: project-quota, requested: requests.memory=500Mi, used: requests.memory=750Mi, limited: requests.memory=1Gi
So that meant that my project was attempting to have 1.25Gi of memory when 1Gi was the limit
In this case I went down to project quotas in my project screen.
and saw something like this in yaml format:
spec:
hard:
file-share-dr-off.storageclass.storage.k8s.io/requests.storage: xGi
file-share-dr-on.storageclass.storage.k8s.io/requests.storage: xGi
limits.cpu: 'x'
limits.memory: 1Gi
pods: 'x'
requests.cpu: 'x'
requests.memory: 1Gi
vsan.storageclass.storage.k8s.io/requests.storage: xGi
So I increased limits.memory and requests.memory to 2Gi for my project quota and hit save.
After that the pod errors got fixed.
And deployment went from 0/1 to 1/1 pods.

Related

error when creating ".": persistentvolumeclaims "wp-pv-claim" is forbidden: exceeded quota

I'm trying to run WordPress by using Kubernetes link, and the only change is I changed 20Gi to 5Gi, but when I run kubectl apply -k ., I get this error:
Error from server (Forbidden): error when creating ".": persistentvolumeclaims "wp-pv-claim" is forbidden: exceeded quota: storagequota, requested: requests.storage=5Gi, used: requests.storage=5Gi, limited: requests.storage=5Gi
I searched but did not find any related answer to mine (or even maybe I'm wrong).
Could you please answer me these questions:
How to solve the above issue?
If the volume's size is limited to 5G, then the pod cannot be bigger than 5G? I mean if I exec into the pod and run a command like dd if=/dev/zero of=file bs=1M count=8000, should it create an 8G file or not? I mean this quota and volume limits whole the pod? Or only a specific path like /var/www/html?
Edit 1
describe pvc mysql-pv-claim
Name: mysql-pv-claim
Namespace: default
StorageClass:
Status: Pending
Volume:
Labels: app=wordpress
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: wordpress-mysql-6c479567b-vzpm5
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 4m (x222 over 59m) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
I decided to summarize our comments conversation for better readability and visibility.
The issue at first seemed to be caused by resourcequota.
Error from server (Forbidden): error when creating ".": persistentvolumeclaims "wp-pv-claim" is forbidden: exceeded quota: storagequota, requested: requests.storage=5Gi, used: requests.storage=5Gi, limited: requests.storage=5Gi
It looked like there was already existing PVC and it wouldn't allow to create a new one.
OP removed the resource quota although it was not necessary in this case since the real issue was with the PVC.
kubectl describe pvc mysql-pv-claim showed the following event:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 4m (x222 over 59m) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
Event message:
persistentvolume-controller no persistent volumes available for this claim and no storage class is set
Since OP created the cluster with kubeadm and kubeadm doesn't come with a predeployed storage provider out of the box; this means that it needs to be added manually. (Storage Provider is a controller that can create a volume and mount it).
Each StorageClass has a provisioner that determines what volume plugin is used for provisioning PVs. This field must be specified. Since there was no storage class in cluster, OP decided to create one and picked Local storage class but forgot that:
Local volumes do not currently support dynamic provisioning [...].
and
Local volumes can only be used as a statically created PersistentVolume. Dynamic provisioning is not supported
This means that a local volume had to be created manually.

Deployment "tiller" exceeded its progress deadline

I'm trying to install tiller server to an Openshift project
Helm/tiller version: 2.9.0
My project name: paytiller
At step 3, executing this command (mentioned as per this document - https://www.openshift.com/blog/getting-started-helm-openshift)
oc rollout status deployment tiller
I get this error:
error: deployment "tiller" exceeded its progress deadline
I'm not clear on what's the error message or could find any logs.
Any idea why this error?
If this doesn't work, what are the other suggestions for templating in Openshift?
EDIT
oc get events
Events:
Type Reason Age From Message
---- ------ ---- ---- ---
Warning Failed 14m (x5493 over 21h) kubelet, example.com Error: ImagePullBackOff
Normal Pulling 9m (x255 over 21h) kubelet, example.com pulling image "gcr.io/kubernetes-helm/tiller:v2.9.0"
Normal BackOff 4m (x5537 over 21h) kubelet, example.com Back-off pulling image "gcr.io/kubernetes-helm/tiller:v2.9.0"
Thanks.
The issue was with the permissions on our OpenShift platform. We didn't have access to download from open-source directly.
We tried to add kubernetes-helm as a docker image to our organization repository and then we were able to pull the image to OpenShift project. It is working now. But still, we didn't get any clue of the issue from the logs.
The status ImagePullBackOff tells you that this image gcr.io/kubernetes-helm/tiller:v2.9.0 could not be pulled from the container registry. So your OpenShift node cannot pull that image for some reason. This is often due to network proxies, a non-existing image (not the issue here) or other restrictions in the (corporate) network.
You can use oc describe pod <pod that shows ImagePullBackOff> to find out the more detailed error message that may help you further.
Also, note that the blog post you linked is from 2017, which is very old. Here is a more current version: Build Kubernetes Operators from Helm Charts in 5 steps
.

openshift v3 online pro volume and memory limit issues

I am trying to run an sonatype/nexus3 on openshift online v3 pro. If I just use the web console to create a new app from image it assigns it only 512Mi and it dies with OOM. It did get created though and logged a lot of java output before it died of out of memory. When using the web console there doesnt appear a way to set the memory on the image. When I try to edited the yaml of the pod it doesn't let me edited the memory limit.
Reading the docs about memory limits it suggests that I can run with this:
oc run nexus333 --image=sonatype/nexus3 --limits=memory=750Mi
Then it doesn't even start. It dies with:
{kubelet ip-172-31-59-148.ec2.internal} Error: Error response from
daemon: {"message":"create
c30deb38b3c26252bf1218cc898fbf1c68d8fc14e840076710c211d58ed87a59:
mkdir
/var/lib/docker/volumes/c30deb38b3c26252bf1218cc898fbf1c68d8fc14e840076710c211d58ed87a59:
permission denied"}
More information from oc get events:
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
16m 16m 1 nexus333-1-deploy Pod Normal Scheduled {default-scheduler } Successfully assigned nexus333-1-deploy to ip-172-31-50-97.ec2.internal
16m 16m 1 nexus333-1-deploy Pod spec.containers{deployment} Normal Pulling {kubelet ip-172-31-50-97.ec2.internal} pulling image "registry.reg-aws.openshift.com:443/openshift3/ose-deployer:v3.6.173.0.21"
16m 16m 1 nexus333-1-deploy Pod spec.containers{deployment} Normal Pulled {kubelet ip-172-31-50-97.ec2.internal} Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/ose-deployer:v3.6.173.0.21"
15m 15m 1 nexus333-1-deploy Pod spec.containers{deployment} Normal Created {kubelet ip-172-31-50-97.ec2.internal} Created container
15m 15m 1 nexus333-1-deploy Pod spec.containers{deployment} Normal Started {kubelet ip-172-31-50-97.ec2.internal} Started container
15m 15m 1 nexus333-1-rftvd Pod Normal Scheduled {default-scheduler } Successfully assigned nexus333-1-rftvd to ip-172-31-59-148.ec2.internal
15m 14m 7 nexus333-1-rftvd Pod spec.containers{nexus333} Normal Pulling {kubelet ip-172-31-59-148.ec2.internal} pulling image "sonatype/nexus3"
15m 10m 19 nexus333-1-rftvd Pod spec.containers{nexus333} Normal Pulled {kubelet ip-172-31-59-148.ec2.internal} Successfully pulled image "sonatype/nexus3"
15m 15m 1 nexus333-1-rftvd Pod spec.containers{nexus333} Warning Failed {kubelet ip-172-31-59-148.ec2.internal} Error: Error response from daemon: {"message":"create 3aa35201bdf81d09ef4b09bba1fc843b97d0339acfef0c30cecaa1fbb6207321: mkdir /var/lib/docker/volumes/3aa35201bdf81d09ef4b09bba1fc843b97d0339acfef0c30cecaa1fbb6207321: permission denied"}
I am not sure why if I use the web console I cannot assign more memory. I am not sure why running it with oc run dies with the mkdir error. Can anyone tell me how to run sonatype/nexus3 on openshift online pro?
Looking in the documentation I see that it is a Java VM solution.
When using Java 8, memory usage can be DRAMATICALLY IMPROVED using only the following 2 runtime Java VM options:
... "-XX:+UnlockExperimentalVMOptions", "-XX:+UseCGroupMemoryLimitForHeap" ...
I just deployed my container (Spring Boot JAR) that consumed over 650 MB RAM. With just these two (new) options RAM consumption dropped to just 270 MB!!!
So, with these 2 runtime settings all OOM's are left far behind! Enjoy!
You may want to also follow along with the tutorial that is in the OpenShift docs https://docs.openshift.com/online/dev_guide/app_tutorials/maven_tutorial.html
I have had success deploying this in OpenShift Online Pro
Okay the mkdir /var/lib/docker/volumes/ permission denied seems to be that the image needs a /nexus-data mount and that is refused. I saw that by deploying from the web console (dies with OOM) but the edit yaml for the created pod to see the generated volume mount.
Creating the image with the following yaml using cat nexus3_pod.ephemeral.yaml | oc create -f - with the volume mount and explicit memory settings the container will now start up:
apiVersion: "v1"
kind: "Pod"
metadata:
name: "nexus3"
labels:
name: "nexus3"
spec:
containers:
-
name: "nexus3"
resources:
requests:
memory: "1200Mi"
limits:
memory: "1200Mi"
image: "sonatype/nexus3"
ports:
-
containerPort: 8081
name: "nexus3"
volumeMounts:
- mountPath: /nexus-data
name: nexus3-1
volumes:
- emptyDir: {}
name: nexus3-1
Notes
The mage sets -Xmx1200m as documented at sonatype/docker-nexus3. So if you assign memory less than 1200Mi it will crash with OOM when the heap grows over the limit. You may as well set requested and max to be the max heap side anything.
When the allocated memory was too low it crashed die just as it was setting up the DB which corrupted the db log which meant it then got in a crash loop "couldn't load 4 byte from 0 byte file" when I recreated it with more memory. It seems that with an emptyDir the files hang around between crash restarts and memory changes (that's documented behaviour I think). I had to recreate a pod with a different name to get a clean emptyDir and assigned memory of 1200Mi to get it to all start.

Openshift online deployment error when adding volume to pod

I have pod and I am attempting to attach a persistent mysql storage to it. Then deployment starts and after waiting a while it fails with the following error on the log:
--> Scaling up php-4 from 0 to 1, scaling down php-1 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling php-4 up to 1
--> FailedCreate: php-4 Error creating: pods "php-4-" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=1,limits.memory=512Mi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi
error: timed out waiting for "php-4" to be synced
If this is caused by limits, how can I deploy a new version of a pod with new config if I can only use one at a time? Is there something that I am missing?
If you are at the limit on resources a rolling deployment will not work as you cannot create a new pod as that will exceed resource limits. You need to change the deployment strategy in the deployment config from Rolling to Recreate if you want to run at resource limits.

Container-VM Image with GPD Volumes fails with "Failed to get GCE Cloud Provider. plugin.host.GetCloudProvider returned <nil> instead"

I currently try to switch from the "Container-Optimized Google Compute Engine Images" (https://cloud.google.com/compute/docs/containers/container_vms) to the "Container-VM" Image (https://cloud.google.com/compute/docs/containers/vm-image/#overview). In my containers.yaml, I define a volume and a container using the volume.
apiVersion: v1
kind: Pod
metadata:
name: workhorse
spec:
containers:
- name: postgres
image: postgres:9.5
imagePullPolicy: Always
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
volumes:
- name: postgres-storage
gcePersistentDisk:
pdName: disk-name
fsType: ext4
This setup worked fine with the "Container-Optimized Google Compute Engine Images", however fails with the "Container-VM". In the logs, I can see the following error:
May 24 18:33:43 battleship kubelet[629]: E0524 18:33:43.405470 629 gce_util.go:176]
Error getting GCECloudProvider while detaching PD "disk-name":
Failed to get GCE Cloud Provider. plugin.host.GetCloudProvider returned <nil> instead
Thanks in advance for any hint!
This happens only when kubelet is run without the --cloud-provider=gce flag. The problem, unless is something different, is dependant on how GCP is launching Container-VMs.
Please contact with google cloud platform guys.
Note if this happens to you when using GCE: Add --cloud-provider=gce flag to kubelet in all your workers. This only applies to 1.2 cluster versions because, if i'm not wrong, there is an ongoing attach/detach design targeted for 1.3 clusters which will move this business logic out of kubelet.
In case someone is interested in the attach/detach redesign here it is its corresponding github issue: https://github.com/kubernetes/kubernetes/issues/20262