Some 502 errors in GCP HTTP Load Balancing - google-compute-engine

Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.
The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: 130.211.0.0/22 tcp:1-5000 Apply to all targets
It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.
Any help will be apreciated.

It seems that there are no an easy solution for this.
As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):
It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.
In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).
UPDATE:
It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/
They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).

I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(
But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.
Eg mine were:
jsonPayload: { statusDetails: "failed_to_pick_backend" #type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBal‌​ancerLogEntry" }
If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:
https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.btzyusgi6

Related

60 Second Timeout on Elastic Beanstalk

I have a single-instance (NO load balancer) Docker container (NO proxy server) that times out at exactly sixty seconds no matter what I do.
Yes, I'm aware of the many seemingly "duplicate" questions. I've been trying to solve this problem for 40+ hours. I've seen them all.
Every single answer to these questions informs the user that they must change the settings of NGINX or the load balancer.
However, I have NEITHER NGINX or a load balancer for the environment, yet it still times out. I am mostly convinced that this is an AWS bug.
I have an endpoint titled time_test for the mini server I created. When I make a POST request to the endpoint, I get a timeout at exactly 60 seconds (the request throws an exception on my end).
Here's the Python code to make the request.
import requests
url = f"http://...us-east-1.elasticbeanstalk.com/"
time_to_sleep = 65
url += f"time_test?time_to_sleep={time_to_sleep}"
response = requests.post(url=url, timeout=10000)
This throws an HTTPSException error, indicating that the server terminated the response, always at exactly 60 seconds.
However, the logs show a successful response.
My logs (specifically, "eb-docker/containers/eb-current-app/eb-blahblah-stdouterr.log) shows
[01/Jun/2022 22:05:49] "POST /time_test?time_to_sleep=65 HTTP/1.1" 200 -
Note the 200 successful status code.
I'm going to continue to find an answer to this problem, which seemingly has none, and will report back if so. Any help with how to change the environment to accept >60 second requests would be greatly appreciated. Please don't reply, "You should have shorter request times." Not helpful or applicable.
(Platform = Docker running on 64bit Amazon Linux 2/3.4.10)
Related:
How to increase FastAPI timeout in Docker to be deployed on AWS EB?
Elastic Beanstalk WebSocket Connection Dropped
PHP beanstalk application giving 504 errors
Blazor Server Side - Frequent 504 errors in AWS environment
504 error on aws elastic beanstalk
Deploying ebextensions on Elastic beanstalk and EC2
AWS bug. It magically started working after I reported this issue to support. No changes. Considering it magically stopped working, that's the conclusion I've come to.

Connect Timeout Error on cloudhub : Mule version:4.2.2

I am trying to hit an https client api which is working fine on postman(gives response in 800ms) and in local mule flow but it is not working fine on cloudhub . I am getting Connect Timeout error. It tries connecting for 30 secs(as per logs) and then gives HTTP:CONNECTIVITY error.
failed: Connect timeout.
errorType=HTTP:CONNECTIVITY
cause=org.mule.extension.http.api.error.HttpRequestFailedException
Response Timeout that I have set is 5 mins.
The flow was working fine when deployed on cloudhub before.It stopped working a few days ago though I didn't make any changes to my code.I am unable to debug this issue as it is not reproducible on my local env(it works perfectly). Any help would be appreciated.
There are 4 different types of general timeouts mule HTTP calls offer. Each has its own differences.
Connection Idle Timeout
Response Timeout
Max Idle Timeout
Query or Transactions Timeout ( Applies for DB Connectors)
Since you are getting
HTTP:CONNECTIVITY ERROR.
Applying a 5 min Response Timeout doesn't help.
Response Timeout (means taking longer time to respond) should be worried only after Establishing a Connection Handshake.
Your problem is with the Connection itself.
The only possible way you could try fixing this is by Applying a Connection Idle Timeout and a Reconnection Strategy with some frequency gaps.
Since you are so sure about tests in local. I suggest you the below two steps:
1. Try using the same HTTP connector configuration in a separate new mule APP. Try with a simple listener and the failing requestor. Also add one more freely available online REST services into your code in other extra flow. Now try to test both. See which one is working and which is failing.
This would tell if it's a real HTTP CONNECTIVITY problem or anything else related to some mule bug.
2. Check your configurations once again and make sure if your hitting the same endpoint in the cloudhub version.
Finally, I hope you did not accidentally put any proxy conf in the local version.
If it was working, probably there was a networking change in the other side that prevents access from the CloudHub application. You didn't share the URL so it is not clear if it is an internal host or a public host. We also don't know if there is some kind of whitelisting on the server side.
You can test connectivity to the HTTP host and port using the Network Tools application, to see if it accessible from your CloudHub environment.

GCP HTTPS load balancing: when is it safe to delete instance?

TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer

Google Cloud Platform Target Pool HTTPS Health Check

I am attempting to use purely https with my compute engine. I have a network load balancer created that forwards to a pool with my instance in it. However, the pool has constantly failing health checks because it won't let me configure a health check that uses https.
I'm using apache to redirect 80 to 443. Does anyone know how to either create an https health check or have the http health check follow the redirect?
Thanks for any help.
--edit--
I finally came across some documentation at http://googlecloudplatform.blogspot.com/2015/07/Debugging-Health-Checks-in-Load-Balancing-on-Google-Compute-Engine.html.
Failure 5: Not answering directly with a 200 response code The web server may be configured to redirect to a page that returns an HTTP 200 response code. The health check will not follow the redirect; it expects the health check page to return a 200 directly.
This basic capability has been supported at every other hosting provider we've been on. Why can't this be done? What am I missing?
I spent the whole day trying to configure a purely https based load balancer in GCloud for a Kubernetes cluster with an ingress controller.
I finally got it working, so maybe I share my experience with people that struggle with the same configuration. If the health-check fails for the instances you will usually see the following accessing your websites URL.
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
1) Protocol: GCloud introduced new health checks which can be configured for HTTPS, SSLTCP, SSL, HTTP, HTTPS, or HTTP/2 probing. This can help the original problem to prevent a redirect from port 80 to port 443.
2) Path: The most common issue is a that the "/" path of your application will not return a 200 OK and thus let the health issue fail. This can be prevented by adding a path argument to your health check e.g. "/index".
3) Ingress HTTPS: This is relatively simple. Adding a secret or a pre-shared-cert to your ingress.yaml will automatically result in an HTTPS Load Balancer instead of HTTP. Further information to follow are here
Lastly, the guide from the docs for Setting up HTTP Load Balancing with Ingress .
However, even though the new HTTPS Health checks seem to work, they are still in the beta phase and bugs are reported in the issue tracker. The documentation for the gcloud-ingress-controller can be found here.

Compute Engine HTTP Load Balancing 502 error

We're having significant issues with our http load balancer from certain IPs only.
I've seen a few other posts here about this. We've made sure the firewall is ok, I've even deleted and recreated the forwarding rules. Which is blasted annoying since the IP changes.
Still no joy. The problem only affects certain IP addresses - and if I post the same data to the IP of one of the servers, I have no problem.
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>502 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>
EDIT
We use cloudflare - usually this is actually disabled for this host however, I have just re-enabled it and now traffic is accepted again. Presumably since the traffic originates from a CF ip.
A 502 error is a "bad gateway" response. Have you checked the health check status of your instances at the time that the 502 errors are occurring?
You haven't mentioned whether you're running backends in more than one region. It's possible that your backends in one region are all being marked unhealthy at once, which is causing your failures.
Are your backend services using the default HTTP health check, or have you customized it? If the former, you might consider defining a more lenient health check for your backends (though this may mask actual application server failures). The default is to check backends every 5s on "/" with a 5s timeout, and require 2 consecutive failures or successes to change the state.
I've seen similar 502 errors when the backend service is too busy to handle the incoming traffic. Try adding more instances and it should go away.
However, I see this as a bandaid fix as this does not really solve the issue
I've configured a HTTP(S) Load balancer as per the documentation on https://cloud.google.com/compute/docs/load-balancing/http/
When I try to access the site via the Public IP address associated with the Load balancer. I'm getting a 502 response with the message:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
I believe this is coming from the load balancer.
Anyone have any insight into what might be going on, what more I should be looking at?