OpenShift Online Next Gen "Unable to mount volumes for pod" - openshift

I'm trying to use a regular EBS persistent storage volume in OpenShift Online Next Gen, and getting the following error when attempting to deploy:
Unable to mount volumes for pod "production-5-vpxpw_instanttabletop(d784f054-a66b-11e7-a41e-0ab8769191d3)": timeout expired waiting for volumes to attach/mount for pod "instanttabletop"/"production-5-vpxpw". list of unattached/unmounted volumes=[volume-mondv]
Followed (after a while) by multiple instances of:
Failed to attach volume "pvc-702876a2-a663-11e7-8348-0a69cdf75e6f" on node "ip-172-31-61-152.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0fb5515c87914b844" to instance "i-08d3313801027fbc3": VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099
The log for the deploy pod looks like this after it all times out:
--> Scaling production-5 to 1
--> Waiting up to 10m0s for pods in rc production-5 to become ready
W1001 05:53:28.496345 1 reflector.go:323] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:509: watch of *api.Pod ended with: too old resource version: 1455045195 (1455062250)
error: update acceptor rejected production-5: pods for rc "production-5" took longer than 600 seconds to become ready
I thought at first that this might be related to this issue, but the only running pods are the deploy and the one that's trying to start, and I've switched to a Recreate strategy as suggested there, with no results.
Things did deploy and run normally the very first time, but since then I haven't been able to get it to deploy successfully.
Can anyone shed a little light on what I'm doing wrong here?
Update #1:
As an extra wrinkle, sometimes when I deploy it's taking what seems to be a long time to spin up deploy pods for this (I don't actually know how long it should take, but I get a warning suggesting things are going slowly, and my current deploy is sitting at 15+ minutes so far without having stood up).
In the deploy pod's event list, I'm seeing multiple instances each of Error syncing pod and Pod sandbox changed, it will be killed and re-created. as I wait, having touched nothing.
Doesn't happen every time, and I haven't discerned a pattern.
Not sure if this is even related, but seemed worth mentioning.
Update #2:
I tried deploying again this morning, and after canceling one deploy which was experiencing the issue described in my first update above, things stood up successfully.
I made no changes as far as I'm aware, so I'm baffled as to what the issue is or was here. I'll make a further update as to whether or not the issue recurs.
Update #3
After a bunch of further experimentation, I seem to be able to get my pod up and running regularly now. I didn't change anything about the configuration, so I assume this is something to do with sequencing, but even now it's not without some irregularities:
If I start a deploy, the existing running pod hangs indefinitely in the terminating state according to the console, and will stay that way until it's hard deleted (without waiting for it to close gracefully). Until that happens, it'll continue to produce the error described above (as you'd expect).
Frankly, this doesn't make sense to me, compared to the issues I was having last night - I had no other pods running when I was getting these errors before - but at least it's progress in some form.
I'm having some other issues once my server is actually up and running (requests not making it to the server, and issues trying to upgrade to a websocket connection), but those are almost certainly separate, so I'll save them for another question unless someone tells me they're actually related.
Update #4
OpenShift's ongoing issue listing hasn't changed, but things seem to be loading correctly now, so marking this as solved and moving on to other things.
For posterity, changing from Rolling to Recreate is key here, and even then you may need to manually kill the old pod if it gets stuck trying to shut down gracefully.

You cannot use a persistent volume in OpenShift Online with an application which has the deployment strategy set as 'Rolling'. Edit the deployment configuration and make sure the deployment strategy is set to 'Recreate'.
You state you used 'replace'. If you set it to that by editing the JSON/YAML of the deployment configuration, the value change would have been discarded as 'replace' isn't a valid option.

The error is clearly indicating that the volume is already attached to some other running instance.
VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099
You must do a cleanup by either:
1) De-attach the volume from running instance and reattaching. Be careful about data, because EBS volume' lifecycle is limited to pod lifecycle.
2) Before creating another deployment for a new build, make sure that the earlier running container instance is killed (by deleting the container instance).

Related

VerneMQ plugin_chain_exhausted Authentication MySQL

I have a running instance of VerneMQ (cluster of 2 nodes) on Google kubernets and using MySQL (CloudSQL) for Auth. Server accepts connections over TLS
It works fine, but after a few days i start seeing this message on the log:
can't authenticate client {[],<<"Client-id">>} from X.X.X.X:16609 due to plugin_chain_exhausted
The client app (paho) complains that the server refused the connection for being "not authorized (code=5 in paho error)"
after a few retry it finally connects. but every time it get's harder and harder until it just won't connect anymore
If i restart VerneMQ everything get's back to normal
I have only 3 clients currently connected at most, at the same time.
clients already connected have no issues in pub/sub.
In my configuration i have (among other things):
log.console.level=debug
plugins.vmq_diversity=on
vmq_diversity.mysql.* = all of them set
allow_anonymous=off
vmq_diversity.auth_mysql.enabled=on
it's like the server degrades over time. the status webpage reports no problem
My verne server was build from the git repository about a month ago and runs on a docker container
what could be the cause?
what else could i check to find posibles causes? maybe a diversity missconfiguration?
Tks
To quickly explain the plugin_chain_exhausted log: with Verne you can run multiple authentication/authorization plugins, and they will be checked in a chain. If one plugin allows the client, it will be in. If no plugin allows the client, you'll see the log above.
This does not explain the behaviour you describe, though. I don't think I have seen that.
In any case, the first thing to check is whether you actually run multiple plugins. For instance: have you disabled the vmq.passwd and the vmq.acl plugins?

Communication link failure: 1047 WSREP has not yet prepared node for application use in

We had a three-node cluster with MariaDB 10.4. We had an outage and the servers all rebooted with one having an irrecoverable network issue at the time.
We set up another server and added it to the cluster as a third member later.
However, ever since that, we have constantly been getting this error every now and then.
*3287799 FastCGI sent in stderr: "PHP message: An Error occurred while handling another error:
PDOException: SQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use in /var/....yii2/db/Command.php:1293
In order to fix this issue, we turned down all three nodes one by one and then re-initialized the cluster, even with a new cluster name and all.
The first one was started with "galera_new_cluster" and the remaining two were added to this cluster. However, we still kept getting the same error intermittently.
The workaround at mariadb galera - Error when a node shutdown ERROR 1047 WSREP has not yet prepared node for application use was followed but that didn't do anything, as expected.
Next, what we did is set up a single fresh server and installed the new 10.5.X MariaDB server on it. Took backup from the old cluster using mariabackup and restored it onto this new single server.
This single server was set up as a new cluster with fresh details and everything. We wanted to run it as a single node cluster to make sure if the error still persisted. Oddly enough, the error is still there and it comes off every half an hour or so.
Has anyone got any clue what could be the reason for this weird issue we're facing? Currently, we don't know what exactly is the issue which is why we're facing a hard time solving it.
Any help would be greatly appreciated.
Update:
We turned off galera on this single-node cluster and ran it as a simple stand-alone mariadb server. However we still go the same errors in our web-server's logs. This is bonkers.
Any idea? Anyone?

OpenShift MySQL failed to attach volume

I have a small Python Flask server running on OpenShift starter us-west-1. I use a MySQL container for data storage. Yesterday I scaled down the MySQL application from 1 to 0 pods. When I tried to scale it back up to 1 pod, the container creation keeps failing when trying to mount the persistent volume:
Failed to attach volume "pvc-8bcc2d2b-8d92-11e7-8d9c-06d5ca59684e" on node "ip-XXX-XXX-XXX-XXX.us-west-1.compute.internal" with: Error attaching EBS volume "vol-08b957e6975554914" to instance "i-05e81383e32bcc8ac": VolumeInUse: vol-08b957e6975554914 is already attached to an instance status code: 400, request id: 3ec76894-f611-445f-8416-2db2b1e9c5b7
I have seen some suggestions that say that the deployment strategy needs to be "Recreate", but it is already set like that. I have tried scaling down and up again multiple times. I have also tried to manually stop the pod deployment and start a new one, but it keeps giving the same errors.
Any suggestions?
After contacting support, they fixed the volume and recently also updated the platform to prevent this bug in the future.

Jmeter: Getting "java.net.ConnectException: Connection timed out: connect" error

I am trying to hit 350 users but Jmeter failing script by saying Connection timed out.
I have added following:
http.connection.stalecheck$Boolean=true in hc.parameter file
httpclient4.retrycount=1
hc.parameter.file=hc.parameter
Is there anything that I am missing to add on?
This normally indicates a problem on application under test side so I would recommend checking the logs of your application for anything suspicious.
If everything seems to be fine there - check logs of your web and database servers, for instance Apache HTTP Server allows 150 connections by default, MySQL - 100, etc. so you may need to identify whether you are suffering from this form of limits and what needs to be done to raise them
And finally it may be simply lack of CPU or free RAM on application under test side so next time you run your test keep an eye on baseline OS health metrics as application may respond slowly or even hang if it doesn't have any spare headroom. You can use JMeter PerfMon plugin to integrate this form of monitoring with your load test.

OpenShift Pod gets stuck in pending state

MySQL pod in OpenShift gets stuck after new deployment and shows the message "The pod has been stuck in the pending state for more than five minutes." What can I do to solve this? I tried to scale the current deployment pod to 0 and scale the previous deployment pod to 1. But it also got stuck which was working earlier.
If pod stuck in pending state we can remove it by executing
oc delete pod/<name of pod> --grace-period=0
This command would remove pods immediately, but use it with caution because it may leave some process pid files on persistent volumes.