OpenShift Pod gets stuck in pending state - mysql

MySQL pod in OpenShift gets stuck after new deployment and shows the message "The pod has been stuck in the pending state for more than five minutes." What can I do to solve this? I tried to scale the current deployment pod to 0 and scale the previous deployment pod to 1. But it also got stuck which was working earlier.

If pod stuck in pending state we can remove it by executing
oc delete pod/<name of pod> --grace-period=0
This command would remove pods immediately, but use it with caution because it may leave some process pid files on persistent volumes.

Related

Running Fiware-Cygnus listener in CentOS

I have a VM with CentOS installed, where Orion Context Broker, Cygnus, Mosqquito and MongoDB are present. When I am checking connections with the following command:
netstat -ntlpd
I receive the following data Connections
It is seen that something is already listening to ports 8081 and 5050 (which are of Cygnus). But the Cygnus itself is not active, when I use the following:
service cygnus status
There aren't any instance of Cygnus running
While trying to run Cygnus test, it gives me fatal error which states that ports are taken and that the configuration is wrong.
Trying to run cygnus from
sudo service cygnus start
also fails. Here is the systemctl status:
FailedCygnus
After checking what is the process under the PID that is assigned to the Cygnus ports, I have this:
CygnusPorts
Perhaps someone has any clue what that can be? It feels like Cygnus is there but something is configured wrong. Also, is there another way of running Cygnus then, because I need to receive notifications from subscriptions somehow.
Thank you in advance, Stackoverflow!
EDIT 1
Tried killing processes under those PIDs that are listening to ports 5050 and 8081 but it did not help, cygnus still cannot be started.
Currently thinking of simply reinstalling everything.
EDIT 2
So, I have managed to run the simple "dummy" listener using the agent_test file. But I guess it is good only in the beginning and for learning purposes, later using own configurations is preferred?
So, for further investigation using agent-test.conf file is enough for me, the listener works and data is stored in a database. Perhaps in the future I will encounter this problem again, but for now it works.
What I had to do beforehand is to kill existing processes.

RabbitMQ cluster (classic config) split

I try to configure RabbitMQ cluster in a cloud using config file.
My procedure is this:
Get list of instances I want to cluster with (via cloud API, before cluster startup).
Modify config file like that:
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit#host1.my.long.domain.name
cluster_formation.classic_config.nodes.2 = rabbit#host2.my.long.domain.name
...
Run rabbitmq-server
I expect all nodes to cluster but instead there might be 2+ independent clusters. How do I solve this issue?
UPDATE:
I found out that when I run rabbitmqctl join_cluster rabbit#host.in.existing.cluster on node that is in some cluster already, this node leaves his previous cluster (I expected this clusters to merge). That might be root of the problem.
UPDATE 2:
I have 4 instances. 3 run bare rabbitmq-servers, 1 is configured to join other 3. When started, it joins with the last instance in its config, 2 others show no activity in threir logs. Happens on classic config and erlang config both.
When you initially start up your cluster, there is no means to resolve race condition. Using peer discovery backends will not help with this issue (tested on etcd).
What actually resolved this issue is not deploying instances simultaneously. When started one by one everything is fine and you get one stable cluster which can handle scaling without failure.

OpenShift MySQL failed to attach volume

I have a small Python Flask server running on OpenShift starter us-west-1. I use a MySQL container for data storage. Yesterday I scaled down the MySQL application from 1 to 0 pods. When I tried to scale it back up to 1 pod, the container creation keeps failing when trying to mount the persistent volume:
Failed to attach volume "pvc-8bcc2d2b-8d92-11e7-8d9c-06d5ca59684e" on node "ip-XXX-XXX-XXX-XXX.us-west-1.compute.internal" with: Error attaching EBS volume "vol-08b957e6975554914" to instance "i-05e81383e32bcc8ac": VolumeInUse: vol-08b957e6975554914 is already attached to an instance status code: 400, request id: 3ec76894-f611-445f-8416-2db2b1e9c5b7
I have seen some suggestions that say that the deployment strategy needs to be "Recreate", but it is already set like that. I have tried scaling down and up again multiple times. I have also tried to manually stop the pod deployment and start a new one, but it keeps giving the same errors.
Any suggestions?
After contacting support, they fixed the volume and recently also updated the platform to prevent this bug in the future.

OpenShift Online Next Gen "Unable to mount volumes for pod"

I'm trying to use a regular EBS persistent storage volume in OpenShift Online Next Gen, and getting the following error when attempting to deploy:
Unable to mount volumes for pod "production-5-vpxpw_instanttabletop(d784f054-a66b-11e7-a41e-0ab8769191d3)": timeout expired waiting for volumes to attach/mount for pod "instanttabletop"/"production-5-vpxpw". list of unattached/unmounted volumes=[volume-mondv]
Followed (after a while) by multiple instances of:
Failed to attach volume "pvc-702876a2-a663-11e7-8348-0a69cdf75e6f" on node "ip-172-31-61-152.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0fb5515c87914b844" to instance "i-08d3313801027fbc3": VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099
The log for the deploy pod looks like this after it all times out:
--> Scaling production-5 to 1
--> Waiting up to 10m0s for pods in rc production-5 to become ready
W1001 05:53:28.496345 1 reflector.go:323] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:509: watch of *api.Pod ended with: too old resource version: 1455045195 (1455062250)
error: update acceptor rejected production-5: pods for rc "production-5" took longer than 600 seconds to become ready
I thought at first that this might be related to this issue, but the only running pods are the deploy and the one that's trying to start, and I've switched to a Recreate strategy as suggested there, with no results.
Things did deploy and run normally the very first time, but since then I haven't been able to get it to deploy successfully.
Can anyone shed a little light on what I'm doing wrong here?
Update #1:
As an extra wrinkle, sometimes when I deploy it's taking what seems to be a long time to spin up deploy pods for this (I don't actually know how long it should take, but I get a warning suggesting things are going slowly, and my current deploy is sitting at 15+ minutes so far without having stood up).
In the deploy pod's event list, I'm seeing multiple instances each of Error syncing pod and Pod sandbox changed, it will be killed and re-created. as I wait, having touched nothing.
Doesn't happen every time, and I haven't discerned a pattern.
Not sure if this is even related, but seemed worth mentioning.
Update #2:
I tried deploying again this morning, and after canceling one deploy which was experiencing the issue described in my first update above, things stood up successfully.
I made no changes as far as I'm aware, so I'm baffled as to what the issue is or was here. I'll make a further update as to whether or not the issue recurs.
Update #3
After a bunch of further experimentation, I seem to be able to get my pod up and running regularly now. I didn't change anything about the configuration, so I assume this is something to do with sequencing, but even now it's not without some irregularities:
If I start a deploy, the existing running pod hangs indefinitely in the terminating state according to the console, and will stay that way until it's hard deleted (without waiting for it to close gracefully). Until that happens, it'll continue to produce the error described above (as you'd expect).
Frankly, this doesn't make sense to me, compared to the issues I was having last night - I had no other pods running when I was getting these errors before - but at least it's progress in some form.
I'm having some other issues once my server is actually up and running (requests not making it to the server, and issues trying to upgrade to a websocket connection), but those are almost certainly separate, so I'll save them for another question unless someone tells me they're actually related.
Update #4
OpenShift's ongoing issue listing hasn't changed, but things seem to be loading correctly now, so marking this as solved and moving on to other things.
For posterity, changing from Rolling to Recreate is key here, and even then you may need to manually kill the old pod if it gets stuck trying to shut down gracefully.
You cannot use a persistent volume in OpenShift Online with an application which has the deployment strategy set as 'Rolling'. Edit the deployment configuration and make sure the deployment strategy is set to 'Recreate'.
You state you used 'replace'. If you set it to that by editing the JSON/YAML of the deployment configuration, the value change would have been discarded as 'replace' isn't a valid option.
The error is clearly indicating that the volume is already attached to some other running instance.
VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099
You must do a cleanup by either:
1) De-attach the volume from running instance and reattaching. Be careful about data, because EBS volume' lifecycle is limited to pod lifecycle.
2) Before creating another deployment for a new build, make sure that the earlier running container instance is killed (by deleting the container instance).

Cluster not responding, weird error message

My container engine cluster has a red exclamation mark next to its name in the Google cloud console overview of the container engine. A tooltip says "The cluster has a problem. Click the cluster name for details." Once I click the name I don't get any more infos, it's just the usual summary.
Stackdriver doesn't report anything unusual. No incidents are logged, all pods are marked as healthy but I can't reach my services.
Trying to get infos or logs via kubectl doesn't work:
kubectl cluster-info
Unable to connect to the server: dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
How can I debug this problem? And what does this cryptic message mean anyway?
Are you able to use other kubectl commands such as kubectl get pods?
This sounds like the cluster isn't set up correctly or there's some network issue. Would you also try kubectl config view to see how your cluster is configured? More specifically, look for current-context and clusters fields to see if your cluster is configured as expected.
In our case it was a billing issue. Someone mistakenly disabled the billing profile for our project. We re-enabled it and waited a while, after 20 - 30 mins the cluster came back up with no errors