OpenShift MySQL failed to attach volume - mysql

I have a small Python Flask server running on OpenShift starter us-west-1. I use a MySQL container for data storage. Yesterday I scaled down the MySQL application from 1 to 0 pods. When I tried to scale it back up to 1 pod, the container creation keeps failing when trying to mount the persistent volume:
Failed to attach volume "pvc-8bcc2d2b-8d92-11e7-8d9c-06d5ca59684e" on node "ip-XXX-XXX-XXX-XXX.us-west-1.compute.internal" with: Error attaching EBS volume "vol-08b957e6975554914" to instance "i-05e81383e32bcc8ac": VolumeInUse: vol-08b957e6975554914 is already attached to an instance status code: 400, request id: 3ec76894-f611-445f-8416-2db2b1e9c5b7
I have seen some suggestions that say that the deployment strategy needs to be "Recreate", but it is already set like that. I have tried scaling down and up again multiple times. I have also tried to manually stop the pod deployment and start a new one, but it keeps giving the same errors.
Any suggestions?

After contacting support, they fixed the volume and recently also updated the platform to prevent this bug in the future.

Related

docker-compose, NodeJS starts before MYSQL even when using depends_on

This only occurs whenever building the project docker-compose.yml but for some reason, docker doesn't wait for the port to become active and then execute the next service.
My question is: if there is any way to do this without using wait-for-it or programs similar to it
docker-compose logs:
Edit: I have also tried this which was unsuccessful.
Thanks in advance!
depends_on means only that it will wait when container starts but it does not mean that service inside that container is ready.
Look here how you have to wait for db is ready https://github.com/api-platform/api-platform/blob/master/api/docker/php/docker-entrypoint.sh#L29

ECS EC2 Launch Type: Service database connection string

I am trying out a small POC (learning experiment) on docker. I have 3 docker images, one each for a storefront, a search engine and a database engine called, storefront, solr, docmysql respectively. I have tried running them in a docker swarm (on a single node) on ec2 and it works fine.
In the POC, I next needed to move this to AWS ECS using the EC2 launch type on a single Non-Amazon ECS-Optimized AMI. I have installed and started a ecs-agent on this. I have created 3 services with one task for each of the 3 images configured as containers within the task. The question is about connecting to the database from the storefront.
The storefront has a property file where the database connection is typically defined as
"jdbc:mysql://docmysql/hybris64?useConfigs=maxPerformance&characterEncoding=utf8&useSSL=false".
This worked when I ran it as a docker swarm. Once I moved it to ECS (EC2 launch type), I had to expose the port 3306 from my task/container for the docmysql service. This gave me a service endpoint of docmysql.local, with 'local' being a private namespace. I tried changing the connection string to
"jdbc:mysql://docmysql.local/hybris64?useConfigs=maxPerformance&characterEncoding=utf8&useSSL=false"
in the property file and it always fails with " Name or service not known". What should my connection string be? When the service is created I see 2 entries in Route 53, one SRV record and a A record. The A record has as its name a .docmysql.local, If I use this in the database connection string, I see that it works but obvious not the right thing to do with the hadcoded taskid. I have read about AWS Cloud Map (servicediscovery) but still not very clear how to go about it. I will not be putting any loadbalancer in front of my DB task in the service, there will always be only one task for the db.
So what is the best way to generate the connection string that works. Also why did I not have issues when I ran it as a docker swarm.
I know I can use an RDS instead of stating my own database, I will try that but for now need this working as I have started with this. Thanks for any help.
Well, I've raised some points before my own solution within the problem:
Do you need your instance to scale using ECS? If not, migrate it to RDS.
Do you need to deploy it on EC2-Type? If not, use fargate, it is more simple to handle.
Now, I've faced that issue on Fargate, and discovered that depending on your container/task definitions, it can be used inside the same task for testing purposes, so, 127.0.0.1 should be the answer.
On different tasks you need to work with awsvpc network mode so, you will have this:
Each task that uses the awsvpc network mode receives its own elastic network interface, which is attached to the container instance that hosts it. (FROM AWS)
My suggestion is to create a Lambda Function to discover your network interface dynamically.
Read this for deeply understanding:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html
https://aws.amazon.com/blogs/developer/invoking-aws-lambda-functions-from-java/

RabbitMQ cluster (classic config) split

I try to configure RabbitMQ cluster in a cloud using config file.
My procedure is this:
Get list of instances I want to cluster with (via cloud API, before cluster startup).
Modify config file like that:
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit#host1.my.long.domain.name
cluster_formation.classic_config.nodes.2 = rabbit#host2.my.long.domain.name
...
Run rabbitmq-server
I expect all nodes to cluster but instead there might be 2+ independent clusters. How do I solve this issue?
UPDATE:
I found out that when I run rabbitmqctl join_cluster rabbit#host.in.existing.cluster on node that is in some cluster already, this node leaves his previous cluster (I expected this clusters to merge). That might be root of the problem.
UPDATE 2:
I have 4 instances. 3 run bare rabbitmq-servers, 1 is configured to join other 3. When started, it joins with the last instance in its config, 2 others show no activity in threir logs. Happens on classic config and erlang config both.
When you initially start up your cluster, there is no means to resolve race condition. Using peer discovery backends will not help with this issue (tested on etcd).
What actually resolved this issue is not deploying instances simultaneously. When started one by one everything is fine and you get one stable cluster which can handle scaling without failure.

OpenShift Online Next Gen "Unable to mount volumes for pod"

I'm trying to use a regular EBS persistent storage volume in OpenShift Online Next Gen, and getting the following error when attempting to deploy:
Unable to mount volumes for pod "production-5-vpxpw_instanttabletop(d784f054-a66b-11e7-a41e-0ab8769191d3)": timeout expired waiting for volumes to attach/mount for pod "instanttabletop"/"production-5-vpxpw". list of unattached/unmounted volumes=[volume-mondv]
Followed (after a while) by multiple instances of:
Failed to attach volume "pvc-702876a2-a663-11e7-8348-0a69cdf75e6f" on node "ip-172-31-61-152.us-west-2.compute.internal" with: Error attaching EBS volume "vol-0fb5515c87914b844" to instance "i-08d3313801027fbc3": VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099
The log for the deploy pod looks like this after it all times out:
--> Scaling production-5 to 1
--> Waiting up to 10m0s for pods in rc production-5 to become ready
W1001 05:53:28.496345 1 reflector.go:323] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:509: watch of *api.Pod ended with: too old resource version: 1455045195 (1455062250)
error: update acceptor rejected production-5: pods for rc "production-5" took longer than 600 seconds to become ready
I thought at first that this might be related to this issue, but the only running pods are the deploy and the one that's trying to start, and I've switched to a Recreate strategy as suggested there, with no results.
Things did deploy and run normally the very first time, but since then I haven't been able to get it to deploy successfully.
Can anyone shed a little light on what I'm doing wrong here?
Update #1:
As an extra wrinkle, sometimes when I deploy it's taking what seems to be a long time to spin up deploy pods for this (I don't actually know how long it should take, but I get a warning suggesting things are going slowly, and my current deploy is sitting at 15+ minutes so far without having stood up).
In the deploy pod's event list, I'm seeing multiple instances each of Error syncing pod and Pod sandbox changed, it will be killed and re-created. as I wait, having touched nothing.
Doesn't happen every time, and I haven't discerned a pattern.
Not sure if this is even related, but seemed worth mentioning.
Update #2:
I tried deploying again this morning, and after canceling one deploy which was experiencing the issue described in my first update above, things stood up successfully.
I made no changes as far as I'm aware, so I'm baffled as to what the issue is or was here. I'll make a further update as to whether or not the issue recurs.
Update #3
After a bunch of further experimentation, I seem to be able to get my pod up and running regularly now. I didn't change anything about the configuration, so I assume this is something to do with sequencing, but even now it's not without some irregularities:
If I start a deploy, the existing running pod hangs indefinitely in the terminating state according to the console, and will stay that way until it's hard deleted (without waiting for it to close gracefully). Until that happens, it'll continue to produce the error described above (as you'd expect).
Frankly, this doesn't make sense to me, compared to the issues I was having last night - I had no other pods running when I was getting these errors before - but at least it's progress in some form.
I'm having some other issues once my server is actually up and running (requests not making it to the server, and issues trying to upgrade to a websocket connection), but those are almost certainly separate, so I'll save them for another question unless someone tells me they're actually related.
Update #4
OpenShift's ongoing issue listing hasn't changed, but things seem to be loading correctly now, so marking this as solved and moving on to other things.
For posterity, changing from Rolling to Recreate is key here, and even then you may need to manually kill the old pod if it gets stuck trying to shut down gracefully.
You cannot use a persistent volume in OpenShift Online with an application which has the deployment strategy set as 'Rolling'. Edit the deployment configuration and make sure the deployment strategy is set to 'Recreate'.
You state you used 'replace'. If you set it to that by editing the JSON/YAML of the deployment configuration, the value change would have been discarded as 'replace' isn't a valid option.
The error is clearly indicating that the volume is already attached to some other running instance.
VolumeInUse: vol-0fb5515c87914b844 is already attached to an instance status code: 400, request id: 54dd24cc-6ab0-434d-85c3-f0f063e73099
You must do a cleanup by either:
1) De-attach the volume from running instance and reattaching. Be careful about data, because EBS volume' lifecycle is limited to pod lifecycle.
2) Before creating another deployment for a new build, make sure that the earlier running container instance is killed (by deleting the container instance).

OpenShift Pod gets stuck in pending state

MySQL pod in OpenShift gets stuck after new deployment and shows the message "The pod has been stuck in the pending state for more than five minutes." What can I do to solve this? I tried to scale the current deployment pod to 0 and scale the previous deployment pod to 1. But it also got stuck which was working earlier.
If pod stuck in pending state we can remove it by executing
oc delete pod/<name of pod> --grace-period=0
This command would remove pods immediately, but use it with caution because it may leave some process pid files on persistent volumes.