I have a HTTPS load balancer configured with one backend service and 3 instance groups:
Endpoint protocol: HTTPS Named port: https Timeout: 600 seconds Health check: ui-health2 Session affinity: Generated cookie Affinity cookie TTL: 0 seconds Cloud CDN: disabled
Instance group Zone Healthy Autoscaling Balancing mode Capacity
group-ui-normal us-central1-c 1 / 1 Off Max. CPU: 80% 100%
group-ui-large us-central1-c 2 / 2 Off Max. CPU: 90% 100%
group-ui-xlarge us-central1-c 2 / 2 Off Max. CPU: 80% 100%
Default host and path rules, SSL terminated.
The problem is the session affinity is not working properly and I have no idea why. Most of the time it seems to work but randomly a request is answered by a different instance with the same GCLB cookie. All this reproduced with a AJAX request every 5 seconds, 20+ requests to instance A, then a request to instance B, then other 20+ requests to A...
I looked at the LB logs and there is nothing strange (apart from the random strange response), the CPU is low. Where I can find out if some instance is "unhealthy" for 5 seconds?
The Apache logs shows no errors in the health pings or the requests.
Maybe there is some strange interaction between the "Balancing mode" and the session affinity?
The load balancer are thought to handle a considerable amount of requests. It balances the cargo of them pretty effective.
The issue here is your load balancer doesn't receive too many requests, then the change of just one request, can modify the load drastically, being an obstacle for the Load Balancer to work efficaciously.
Related
Every time when a client sends any request (say HTTP), the request is received by a load balancer (if set up) and it will redirect the request to one of the instances. Now a connection is established between Client->LB->Server. This will persist as long as the client is constantly sending requests.
But if client stops sending request to the server for a period of time(more then idle time), the load balancer will stop the communication between client and that particular server. Now if the client tries sending the request once again after some period of time then the load balancer should forward that request to some other instance.
What is idle time?
It is a period of time when client is not sending any request to the Load balancer. It generally ranges between 60 to 3600 seconds depending upon the cloud service provider.
Finally my doubt.
Ideally after the idle timeout the load balancer should terminate the existing communication, but this is not the case with GCP's Internal load balancer(i have a PoC in this context too). GCP's load balancer doesnt terminate the communication even after idle time out and maintains it for infinite time. Is there any way one can re configure the load balancer to avoid such infinite time connection?
Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.
The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: 130.211.0.0/22 tcp:1-5000 Apply to all targets
It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.
Any help will be apreciated.
It seems that there are no an easy solution for this.
As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):
It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.
In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).
UPDATE:
It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/
They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).
I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(
But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.
Eg mine were:
jsonPayload: { statusDetails: "failed_to_pick_backend" #type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry" }
If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:
https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.btzyusgi6
TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer
My elastic beanstalk environment keeps switching the environment health from Warning to OK from OK to Warning almost every 15 minutes or so.
The Scale settings is Min: 2 max: 4 and there is 4 EC2 instances running.
Can somebody please help?
Screenshot
I can't tell from your screenshot if new instances were created, so it can be wither the scaling trigger being met, or the load balancing one.
Your load balancer trigger is configured here:
Configuration => Load Balancer => EC2 Instance Health Check
the default value is to perform an HTTP request to the root path (/), but you can configure it as you see fit. In most cases degradation happens because your app response time isn't fast enough. You can change the parameters, or fix the app.
As for the scaling trigger, it's configured here:
Configuration => Scaling => Scaling Trigger
By default this is set to NetworkOut, which is the number of bytes sent out from the node and has nothing to do with your server being overloaded. I'm not sure why beanstalk decided to use this metric by default, but you can change it to CPU utilization or any other metric documented here.
Additional, unexpected HTTPS connections are being made to GCE servers.
This started 2nd October and is affecting europe-west1-b and us-central1-b.
We have the same codebase running on servers in Amazon EC2 that are not affected.
Is anyone else seeing issues with HTTPS traffic to GCE?
UPDATE: Clarification of duplicated connections:
A single HTTPS request from a web browser for example
GET /favicon.ico HTTP/1.1
Results in 5 HTTPS connections opened, no http request is send and then they are closed (before timeout period).
Then a final connection is opened and the request is sent as it should.
Note:
This usually would go undetected. However we only allow 10 SSL connections from a single IP in the space of 1 second (velocity restriction).
I have temporarily increased this to 20 and everything is working OK.
What I don't understand is why this would suddenly start happening and only on GCE servers.
I will update this again when I have looked into the raw SSL traffic.