Elastic Beanstalk randomly goes offline - amazon-elastic-beanstalk

My elastic beanstalk environment keeps switching the environment health from Warning to OK from OK to Warning almost every 15 minutes or so.
The Scale settings is Min: 2 max: 4 and there is 4 EC2 instances running.
Can somebody please help?
Screenshot

I can't tell from your screenshot if new instances were created, so it can be wither the scaling trigger being met, or the load balancing one.
Your load balancer trigger is configured here:
Configuration => Load Balancer => EC2 Instance Health Check
the default value is to perform an HTTP request to the root path (/), but you can configure it as you see fit. In most cases degradation happens because your app response time isn't fast enough. You can change the parameters, or fix the app.
As for the scaling trigger, it's configured here:
Configuration => Scaling => Scaling Trigger
By default this is set to NetworkOut, which is the number of bytes sent out from the node and has nothing to do with your server being overloaded. I'm not sure why beanstalk decided to use this metric by default, but you can change it to CPU utilization or any other metric documented here.

Related

60 Second Timeout on Elastic Beanstalk

I have a single-instance (NO load balancer) Docker container (NO proxy server) that times out at exactly sixty seconds no matter what I do.
Yes, I'm aware of the many seemingly "duplicate" questions. I've been trying to solve this problem for 40+ hours. I've seen them all.
Every single answer to these questions informs the user that they must change the settings of NGINX or the load balancer.
However, I have NEITHER NGINX or a load balancer for the environment, yet it still times out. I am mostly convinced that this is an AWS bug.
I have an endpoint titled time_test for the mini server I created. When I make a POST request to the endpoint, I get a timeout at exactly 60 seconds (the request throws an exception on my end).
Here's the Python code to make the request.
import requests
url = f"http://...us-east-1.elasticbeanstalk.com/"
time_to_sleep = 65
url += f"time_test?time_to_sleep={time_to_sleep}"
response = requests.post(url=url, timeout=10000)
This throws an HTTPSException error, indicating that the server terminated the response, always at exactly 60 seconds.
However, the logs show a successful response.
My logs (specifically, "eb-docker/containers/eb-current-app/eb-blahblah-stdouterr.log) shows
[01/Jun/2022 22:05:49] "POST /time_test?time_to_sleep=65 HTTP/1.1" 200 -
Note the 200 successful status code.
I'm going to continue to find an answer to this problem, which seemingly has none, and will report back if so. Any help with how to change the environment to accept >60 second requests would be greatly appreciated. Please don't reply, "You should have shorter request times." Not helpful or applicable.
(Platform = Docker running on 64bit Amazon Linux 2/3.4.10)
Related:
How to increase FastAPI timeout in Docker to be deployed on AWS EB?
Elastic Beanstalk WebSocket Connection Dropped
PHP beanstalk application giving 504 errors
Blazor Server Side - Frequent 504 errors in AWS environment
504 error on aws elastic beanstalk
Deploying ebextensions on Elastic beanstalk and EC2
AWS bug. It magically started working after I reported this issue to support. No changes. Considering it magically stopped working, that's the conclusion I've come to.

Connectivity issue to database after failover

We have a cluster of Tomcat servers in AWS BeansTalk connected to AWS RDS (MySQL) with Multi-AZ availability.
Some days ago, the RDS instance had a patch applied to the OS which triggered a failover to another RDS instance based on the Multi-AZ availability.
The result was a Production system down during hours (it was at night) until we restarted the Tomcats in each instance. We had thousands of Connection refused errors to database.
According to AWS support, when a failover instance is launched, the endpoint is the same but its IP is changed, and my Tomcats had the old IP cached. So after restarting Tomcat the cache was cleared, the new IP was used and the connectivity issue was resolved. They refer me to this SO question.
That makes a lot of sense however I couldn't reproduce the issue in a controlled test with the same application in Production.
I changed the IP of a domain in /etc/hosts and my current BeansTalk Production Tomcat detected the IP change 30 seconds later, so it should have detected the RDS endpoint IP change too.
The Java ttl property in my BeansTalk environment is set as:
#networkaddress.cache.ttl=-1
So, by default it takes 30 secs as cache, that matches with my experiment.
[EDIT] As suggested in the comments, I've tried to simulate a failover through DNS. In this case, I've changed a CNAME record from a domain to another domain. I did the same test and Tomcat detected the change again 30 seconds later.
Do you have any idea why in this case the RDS endpoint IP change was not detected by Tomcat/JVM?

Some 502 errors in GCP HTTP Load Balancing

Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.
The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: 130.211.0.0/22 tcp:1-5000 Apply to all targets
It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.
Any help will be apreciated.
It seems that there are no an easy solution for this.
As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):
It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.
In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).
UPDATE:
It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/
They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).
I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(
But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.
Eg mine were:
jsonPayload: { statusDetails: "failed_to_pick_backend" #type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBal‌​ancerLogEntry" }
If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:
https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.btzyusgi6

GCP HTTPS load balancing: when is it safe to delete instance?

TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer

Google Compute Instance 100% CPU Utilisation

I am running n1-standard-1 (1 vCPU, 3.75 GB memory) Compute Instance , In my android app around 80 users are online write now and cpu Utilisation of instance is 99% and my app became less responsive. Kindly suggest me the workaround and If i need to upgrade , can I do that with same instance or new instance needs to be created.
Since your app is running already and users are connecting to it, you don't want to do the following process:
shut down the VM instance, keeping the boot disk and other disks
boot a more powerful instance, using the boot disk from step (1)
attach and mount any additional disks, if applicable
Instead, you might want to do the following:
create an additional VM instance with similar software/configuration
create a load balancer and add both the original and new VM to it as a backend
change your DNS name to point to the load balancer IP instead of the original VM instance
Now, your users will be randomly sent to a VM that's least-loaded to see the application, and you can add more VMs if your traffic increases.
You did not describe your application in detail, so it's unclear if each VM has local state (e.g., runs a database) or there's a database running externally. You will still need to figure out how to manage the stateful systems such as database or user-uploaded data from all the VM instances, which is hard to advise on given the little information in your quest.