Elastic Beanstalk checking health too soon after launching instance - amazon-elastic-beanstalk

I'm having problems with the configuration of an Elastic Beanstalk environment. Almost immediately, within maybe 20 seconds of launching a new instance, it's starts showing warning status and reporting that the health checks are failing with 500 errors.
I don't want it to even attempt to do a health check on the instance until it's been running for at least a couple of minutes. It's a Spring Boot application and needs more time to start.
I have an .ebextensions/autoscaling.config declared like so...
Resources:
AWSEBAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 200
DefaultInstanceWarmup: 200
NewInstancesProtectedFromScaleIn: false
TerminationPolicies:
- OldestInstance
I thought the HealthCheckGracePeriod should do what I need, but it doesn't seem to help. EB immediately starts trying to get a healthy response from the instance.
Is there something else I need to do, to get EB to back off and leave the instance alone for a while until it's ready?

The HealthCheckGracePeriod is the correct approach. The service will not be considered unhealthy during the grace period. However, this does not stop the ELB from sending the healthchecks according to the defined (or default) health check interval. So you will still see failing healthchecks, but they won't make the service be considered "unhealthy".
There is no setting to prevent the healthcheck requests from being sent at all during an initial period, but there should be no harm in the checks failing during the grace period.
You can make the HealthCheckIntervalSeconds longer, but that will apply for all healthcheck intervals, not just during startup.

Related

Openshift 3.11 atomic-openshift-node - Service Restart Impact

We have a plan to scaleup the Max Pod Per node from 200 to 350.
Based on the documentation - in order for the new node config to take effect the atomic-openshift-node service needs to be restarted.
This cluster where the node is located, served business critical DCs, PODS, services, routes etc.
The question is, what are the possible operational impact during the restart of the atomic service if any? Or is it totally no direct impact to the applications?
Ref: https://docs.openshift.com/container-platform/3.3/install_config/master_node_configuration.html
Containers should run fine without interruption. But nodeservice (node controller) in some cases can send "NotReady" status for some PODs. I dont know what is the cause. I suspect a race condition, guessing the dependence on the time of restart and readiness probe parameters, maybe on other node performance conditions.
This may result in the service being unavailable for a while if it will be removed from the "router" backends.
Most likely, in case only one node is changed at time and important applications are well scaled with HA rules in mind, there should be no "business impact".
But, in case of node control configmap change (3.11, new design introduced long after the original question), possible many node controllers are restarted in parallel (in fact, it does not happen immediately, but still within a short period), which I consider as problematic consequence of node controll configmap concept (one for all appnodes).

Postfix queue keeps growing

im having some trouble with our mail server since yesterday.
First, the server was down for couple days, thanks to KVM, VMs were paused because storage was apparently full. So i managed to fix the issue. But since the mail server is back online, CPU usage was always at 100%, i checked logs, and there was "millions", of mails waiting in the postfix queue.
I tried to flush the queue, thanks to the PFDel script, it took some times, but all mails were gone, and we were finally able to receive new emails. I also forced a logrotate, because fail2ban was also using lots of CPU.
Unfortunately, after couple hours, postfix active queue is still growing, and i really dont understand why.
Another script i found is giving me that result right now:
Incoming: 1649
Active: 10760
Deferred: 0
Bounced: 2
Hold: 0
Corrupt: 0
is there a possibility to desactivate ""Undelivered Mail returned to Sender" ?
Any help would be very helpful.
Many thanks
You could firstly temporarily stop sending bounce mails completely or set more strict rules in order to analyze the reasons of the flood. See for example:http://domainhostseotool.com/how-to-configure-postfix-to-stop-sending-undelivered-mail-returned-to-sender-emails-thoroughly.html
Sometimes the spammers find some weakness (or even vulnerability) in your configuration or in SMTP server and using that to send the spam (also if it could reach the addressee via bounce only). Mostly in this case, you will find your IP/domain in some common blacklist services (or it will be blacklisted by large mail providers very fast), so this will participate additionally to the flood (the bounces will be rejected by recipient server, what then let grow you queue still more).
So also check your IP/domain using https://mxtoolbox.com/blacklists.aspx or similar service (sometimes they provide also the reason why it was blocked).
As for fail2ban, you can also analyze logs (find some pattern) to detect the evildoers (initial sender), and write custom RE for fail2ban to ban them for example after 10 attempts in 20 minutes (or add it to ignore list for bounce messages in postfix)... so you'd firstly send X bounces, but hereafter it'd ban the recidive IPs, what could also help to reduce the flood significantly.
An last but not least, check your config (follow best practices for it) and set up at least MX/SPF records, DKIM signing/verification and DMARC policies.

Google Cloud auto-scaling thrashes between 0 and 1, even with minimum of 1

I have a managed instance group with auto scaling enabled. I have a minimum of 1 and maximum of 10 with health checks and cpu 0.8
The number of instances continually switches between 0 and 1. Every few minutes. I am unable to find the reason GCP decides to remove an instance and then immediately add it back. Health checks have no logs anywhere.
More concerning is that the minimum instances required is violated.
Thoughts? Thanks!
Edit: This may be due to instances becoming unhealthy, most likely because a firewall rule was needed to allow health checks on the instances. The health check worked for load balancing, but not instance health it seems. I'm using a custom network, so I needed to add the firewall rule.
https://cloud.google.com/compute/docs/load-balancing/health-checks#configure_a_firewall_rule_to_allow_health_checking
Will confirm/update after some monitoring time.
Don't be confused between two different features: the autohealer and
the autoscaling of Managed Instance Groups.
The --min-num-replicas is a parameter of the autoscaler, setting this parameter you are sure that the target number of instances will never be set below a certain threshold. However the autohealing works on its own not following the configuration of the autoscaling.
Therefore when instances belong to a managed group and fail health checks, they are considered dead instances and removed from the pool if autohealing is enabled without taking into account the minimum number of replicas.
It is always best practise to verify that health checks are working properly in order to avoid this kind of misbehaviour. The common issues are:
Firewall rules
Wrong protocols/ports
Server not starting automatically at the powerup of the machine
Notice also that if the health checks are a bit more complex and interact with some kind of software you need to be sure that the instance is started before configuring accordingly initial delay flag, i.e. the length of the period during which the instance is known to be initializing and should not be autohealed even if unhealthy.

Netty HttpServer Chrome Browser Multiple Requests

We use Netty, version 4.1.13. We create HttpServer, HttpServerInitializer, HttpServerHandler and start it through using a port.When we make a request from Chrome Browser, HttpServerInitializer is called 3 or 4 times (sometimes 3, sometimes 4) and it is called again after 10 seconds.When we make a request through Microsoft Edge or through console, it is called one times as expected and HttpServerHandler handles the rest.
What should we do to prevent HttpServerInitializer's handling unnecessary extra requests.We have session operations attached to pipeline on Initializer, so this is a critical issue for us.
The default behaviour of browsers for HTTP 1 is to open several connections (how many depends on the browser) to do requests in parallel. Like that, they can retrieve resources like css, js, images,... in parallel.
The number of connection is configurable into the browser. In general there are two preferences: the maximum number of connections by hostname and the total maximum number of opened connections.
See also: http://www.browserscope.org/?category=network&v=0
So, when you start a request with Chrome, it opens several connections, even if it use only one if there is not so much request done. The idle an unused connections will be closed after some seconds.
I think that's why you see the HttpServerInitializer being called several times, only because there are several connections. So, server side, it's normal, because you don't know if it's different clients or only one with many connections.
I advice you to not do costly operation on Connection Opened event, but only when you receive a valid message/request. Your initializer should only configure the necessary handlers on the pipeline which should be quick and simple, and nothing else.

What does Chrome Network Timings really mean and what does affects each timing length?

I was looking at chrome dev tools #resource network timing to detect requests that must be improved. In the link before there's a definition for each timing but I don't understand what processes are being taken behind the scenes that are affecting the length of the period.
Below are 3 different images and here is my understanding of what's going on, please correct me if I'm wrong.
Stalled: Why there are timings where the request get's stalled for 1.17s while others are taking less?
Request Sent: it's the time that our request took to reach server
TTFB: Time took until the server responds with the first byte of data
Content Download: The time until the whole response reaches the client
Thanks
Network is an area where things will vary greatly. There are a lot of different numbers that go into play with these and they vary between different locations and even the same location with different types of content.
Here is some more detail on the areas you need more understanding with:
Stalled: This depends on what else is going on in the network stack. One thing could not be stalled at all, while other requests could be stalled because 6 connections to the same location are already open. There are more reasons for stalling, but the maximum connection limit is an easy way to explain why it may occur.
The stalled state means, we just can't send the request right now it needs to wait for some reason. Generally, this isn't a big deal. If you see it a lot and you are not on the HTTP2 protocol, then you should look into minimizing the number of resources being pulled from a given location. If you are on HTTP2, then don't worry too much about this since it deals with numerous requests differently.
Look around and see how many requests are going to a single domain. You can use the filter box to trim down the view. If you have a lot of requests going off to the same domain, then that is most likely hitting the connection limit. Domain sharding is one method to handle this with HTTP 1.1, but with HTTP 2 it is an anti-pattern and hurts performance.
If you are not hitting the max connection limit, then the problem is more nuanced and needs a more hands-on debugging approach to figure out what is going on.
Request sent: This is not the time to reach the server, that is the Time To First Byte. All request sent means is the request is sent and it took the network stack X time to carry it out.
Nothing you can do to speed this up, it is more for informational and internal debugging purposes.
Time to First Byte (TTFB): This is the total time for the sent request to get to the destination, then for the destination to process the request, and finally for the response to traverse the networks back to the client.
A high TTFB reveals one of two issues. The first is a bad network connection between the client and server. So data is slow to reach the server and get back. The second cause is, a slow server processing the request. This is either because the hardware is weak or the application running is slow. Or, both of these problems can exist at once.
To address a high TTFB, first cut out as much network as possible. Ideally, host the application locally on a low-resource virtual machine and see if there is still a big TTFB. If there is, then the application needs to be optimized for response speed. If the TTFB is super-low locally, then the networks between your client and the server are the problem. There are various ways to handle this that I won't get into since it is an area of expertise unto itself. Research network optimization, and even try moving hosts and seeing if your server providers network is the issue.
Remember the entire server-stack comes into play here. So if nginx or apache are configured poorly, or your database is taking a long time to respond, or your cache is having trouble, then these can cause delays. They are also difficult to detect locally, since your local server could vary in configuration from the remote stack.
Content Download: This is the total time from the TTFB resolving for the client to get the rest of the content from the server. This should be short unless you are downloading a large file. You should take a look at the size of the file, the conditions of the network, and then judge about how long this should take.