How to automatically remove instances in EB whose status OutOfService is? - amazon-elastic-beanstalk

My current configuration provides that there must be minium 6 healthy instances. If there are no 6 healthy instances, an instance is added to it again 6 healthy are provided. The result is, that my LoadBalancer list 8 instance with 6 healty but 2 OutOfService. How can I terminate these two automatically?

Make sure that your elastic beanstalk configuration is set to use the ELB for health checking. If you do this, then the ASG will cycle instances that become out of service in the ELB.
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.managing.elb.html

Related

Elastic Beanstalk checking health too soon after launching instance

I'm having problems with the configuration of an Elastic Beanstalk environment. Almost immediately, within maybe 20 seconds of launching a new instance, it's starts showing warning status and reporting that the health checks are failing with 500 errors.
I don't want it to even attempt to do a health check on the instance until it's been running for at least a couple of minutes. It's a Spring Boot application and needs more time to start.
I have an .ebextensions/autoscaling.config declared like so...
Resources:
AWSEBAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 200
DefaultInstanceWarmup: 200
NewInstancesProtectedFromScaleIn: false
TerminationPolicies:
- OldestInstance
I thought the HealthCheckGracePeriod should do what I need, but it doesn't seem to help. EB immediately starts trying to get a healthy response from the instance.
Is there something else I need to do, to get EB to back off and leave the instance alone for a while until it's ready?
The HealthCheckGracePeriod is the correct approach. The service will not be considered unhealthy during the grace period. However, this does not stop the ELB from sending the healthchecks according to the defined (or default) health check interval. So you will still see failing healthchecks, but they won't make the service be considered "unhealthy".
There is no setting to prevent the healthcheck requests from being sent at all during an initial period, but there should be no harm in the checks failing during the grace period.
You can make the HealthCheckIntervalSeconds longer, but that will apply for all healthcheck intervals, not just during startup.

Openshift 3.11 atomic-openshift-node - Service Restart Impact

We have a plan to scaleup the Max Pod Per node from 200 to 350.
Based on the documentation - in order for the new node config to take effect the atomic-openshift-node service needs to be restarted.
This cluster where the node is located, served business critical DCs, PODS, services, routes etc.
The question is, what are the possible operational impact during the restart of the atomic service if any? Or is it totally no direct impact to the applications?
Ref: https://docs.openshift.com/container-platform/3.3/install_config/master_node_configuration.html
Containers should run fine without interruption. But nodeservice (node controller) in some cases can send "NotReady" status for some PODs. I dont know what is the cause. I suspect a race condition, guessing the dependence on the time of restart and readiness probe parameters, maybe on other node performance conditions.
This may result in the service being unavailable for a while if it will be removed from the "router" backends.
Most likely, in case only one node is changed at time and important applications are well scaled with HA rules in mind, there should be no "business impact".
But, in case of node control configmap change (3.11, new design introduced long after the original question), possible many node controllers are restarted in parallel (in fact, it does not happen immediately, but still within a short period), which I consider as problematic consequence of node controll configmap concept (one for all appnodes).

Openshift Online issue: pod with persistent volume failed scheduling

I have a small webapp which use to run fine on Openshift Online for 9 months, which consist in a python service and a postgresql database (with, of course, a persistent volume)
All of a sudden, last tuesday, the postgresql pod stopped working, so I tried to redeploy the service. And it's been almost 2 days now that the pod scheduling constantly fail. I have the following entry in the events log:
Failed Scheduling 0/110 nodes are available: 1 node(s) had disk pressure, 5 node(s) had taints that the pod didn't tolerate, 6 node(s) didn't match node selector, 98 node(s) exceed max volume count.
37 times in the last 13 minutes
So, it looks like a "disk full" issue at RH's datacenters, which should be easy to fix but I don't see any notification of it on the status page (https://status.starter.openshift.com/)
My problem looks a lot like the one described for start-us-west-1:
Investigating - Currently Openshift SRE team trying to resolve this incident. There are high chances that you will face difficulties having pods with attached volumes scheduled.
We're sorry for the inconvenience.
Yet I'm on starter-ca-central-1, which should not be affected. Since it's been such a long time, I'm wondering if anyone at RH is aware of the issue ? But I cannot find a way to contact them for users with a starter plan
Anybody face the same issue on ca-central-1 ?
As mentioned by Graham in the comment, https://help.openshift.com/forms/community-contact.html is the way to go
A few hours (12, actually) after posting my issue to this link, I got a feedback from someone at RH who said that my request was taken into account.
This morning, my app is up at last, and the trouble notice in on the status page:
Investigating - Currently Openshift SRE team trying to resolve this incident. There are high chances that you will face difficulties having pods with attached volumes scheduled.
We're sorry for the inconvenience.
Not sure of what would have happened if I hadn't contacted them...
After at least 4 months of normal working my app running on Starter US West 1 suddenly started to get the following error message during the deployment:
0/106 nodes are available: 1 node(s) had disk pressure, 29 node(s)
exceed max volume count, 3 node(s) were unschedulable, 4 node(s) had
taints that the pod didn't tolerate, 6 node(s) didn't match node
selector, 63 Insufficient cpu.
Nothing has changed on settings until the fail started. I've realized that problem just occur on deployments with persistent volume, like PostgreSQL Persistent in my case.
I submitted this issue over the above mentioned url right now. When I got some response or some solution I'll post here.

Google Cloud auto-scaling thrashes between 0 and 1, even with minimum of 1

I have a managed instance group with auto scaling enabled. I have a minimum of 1 and maximum of 10 with health checks and cpu 0.8
The number of instances continually switches between 0 and 1. Every few minutes. I am unable to find the reason GCP decides to remove an instance and then immediately add it back. Health checks have no logs anywhere.
More concerning is that the minimum instances required is violated.
Thoughts? Thanks!
Edit: This may be due to instances becoming unhealthy, most likely because a firewall rule was needed to allow health checks on the instances. The health check worked for load balancing, but not instance health it seems. I'm using a custom network, so I needed to add the firewall rule.
https://cloud.google.com/compute/docs/load-balancing/health-checks#configure_a_firewall_rule_to_allow_health_checking
Will confirm/update after some monitoring time.
Don't be confused between two different features: the autohealer and
the autoscaling of Managed Instance Groups.
The --min-num-replicas is a parameter of the autoscaler, setting this parameter you are sure that the target number of instances will never be set below a certain threshold. However the autohealing works on its own not following the configuration of the autoscaling.
Therefore when instances belong to a managed group and fail health checks, they are considered dead instances and removed from the pool if autohealing is enabled without taking into account the minimum number of replicas.
It is always best practise to verify that health checks are working properly in order to avoid this kind of misbehaviour. The common issues are:
Firewall rules
Wrong protocols/ports
Server not starting automatically at the powerup of the machine
Notice also that if the health checks are a bit more complex and interact with some kind of software you need to be sure that the instance is started before configuring accordingly initial delay flag, i.e. the length of the period during which the instance is known to be initializing and should not be autohealed even if unhealthy.

Queue data publish/duplicate

I am using IBM webSphere MQ 7.5 server as queue manager for my applications.
Already I am receiving data through single queue.
On the other hand there are 3 applications that want to process data.
I have 3 solutions to duplicate/distribute data between them.
Use broker to duplicate 1 to 3 queue - I don't have broker so it is not accessible for me.
Write an application to get from queue and put them in other 3 queues on same machine
Define publish/subscribe definitions to publish input queue to 3 queues on same machine.
I want to know which methods (2 & 3) is preferred and have higher performance and acceptable operational management effort.
Based on the description I would say that going PubSub would achieve the goal; try to thinking in pure PubSub terms rather than thinking about the queues. i.e. You have an application that publishes to a topic with then 3 applications each having their own subscription to get copies of the message.
You then have the flexibility to define durable/nondurable subscriptons for example.
For option # 2, there are (at least) 2 solutions available:
There is an open source application called MMX (Message Multiplexer). It will do exactly what you describe. The only issue is that you will need to manage the application. i.e. If you stop the queue manager then the application will need to be manually restarted.
There is a commercial solution called MQ Message Replication. It is an API Exit that runs within the queue manager and does exactly what you want. Note: There is nothing external to manage as it runs within the queue manager.
I think there is another solution, with MQ only, to define a Namelist which will mirror queue1 to queue2 and queue3
Should be defined like: Source, Destination, QueueManager.
Hope it is useful.
Biruk.