I'm seeing intermittent timeout for my requests from APIM to AFD with 'ClientConnectionFailure'
My architecture:
Architecture:
Customer --> APIM --> AFD --> Bkend Fn Apps
The requests from my APIM to AFD time out with 'ClientConnectionFailure'
The timeout happens due to underlying HTTP protocol closing the connection after default of 100 seconds
Since AFD logs are limited, is there a way to debug where exactly these intermittent failures happen?
Do they happen in AFD or in the backend infra and just show up at AFD level in AppInsights
Related
In starting a session using the aws-sdk-go-v2 library I have tried various settings for changing the websocket timeout value (30 seconds is the current timeout I see). In the AWS documentation they suggested the following:
ctx, _ := context.WithTimeout(context.Background(), time.Hour)
ses, err := ssm.NewFromConfig(cfg).StartSession(ctx, in)
But this timeout is for your request processing, not the steady state communications link. The SDK is effectively setting up a websocket that is to be maintained. I was able to work around this while setting up an RDP session as I could create layer 7 application traffic to keep the socket alive for an hour or more. But in another case I want to use the same SDK to connect to a database and in this case there is no layer 7 that I have access to.
The questions I have are:
Is there a configuration parameter for keeping the SDK websocket alive for more than 30 seconds?
Is this SDK websocket only for transient data exchanges and is therefore designed to terminate within 30 seconds?
Are there any workarounds for a database connection scenario where I might inject layer 7 traffic to keep the websocket alive?
eth_subscribe method of JSON-RPC API requires a full duplex connection, such as WebSocket, hence subscribing to events is impossible via a HTTP connection.
I use Ganache as an ethereum node and Metamask as a browser provider to test my dapp. The Ganache server is listening to requests by means of HTTP. Nevertheless, the dapp have no glitches to receive notifications from the provider after subscription with eth_subscribe.
Have I got it right that Metamask, being encountered with a HTTP server, trys to get its events through any other way (like polling) deep under the hood? And then a dapp developer should not worry about whether the user's node supports duplex conections or not - as long as there is a Metamask provider. Or it's all Ganache - does it just support WebSockets?
I failed to find any clarification on this issue in the docs.
Every time when a client sends any request (say HTTP), the request is received by a load balancer (if set up) and it will redirect the request to one of the instances. Now a connection is established between Client->LB->Server. This will persist as long as the client is constantly sending requests.
But if client stops sending request to the server for a period of time(more then idle time), the load balancer will stop the communication between client and that particular server. Now if the client tries sending the request once again after some period of time then the load balancer should forward that request to some other instance.
What is idle time?
It is a period of time when client is not sending any request to the Load balancer. It generally ranges between 60 to 3600 seconds depending upon the cloud service provider.
Finally my doubt.
Ideally after the idle timeout the load balancer should terminate the existing communication, but this is not the case with GCP's Internal load balancer(i have a PoC in this context too). GCP's load balancer doesnt terminate the communication even after idle time out and maintains it for infinite time. Is there any way one can re configure the load balancer to avoid such infinite time connection?
I have a HTTPS load balancer configured with one backend service and 3 instance groups:
Endpoint protocol: HTTPS Named port: https Timeout: 600 seconds Health check: ui-health2 Session affinity: Generated cookie Affinity cookie TTL: 0 seconds Cloud CDN: disabled
Instance group Zone Healthy Autoscaling Balancing mode Capacity
group-ui-normal us-central1-c 1 / 1 Off Max. CPU: 80% 100%
group-ui-large us-central1-c 2 / 2 Off Max. CPU: 90% 100%
group-ui-xlarge us-central1-c 2 / 2 Off Max. CPU: 80% 100%
Default host and path rules, SSL terminated.
The problem is the session affinity is not working properly and I have no idea why. Most of the time it seems to work but randomly a request is answered by a different instance with the same GCLB cookie. All this reproduced with a AJAX request every 5 seconds, 20+ requests to instance A, then a request to instance B, then other 20+ requests to A...
I looked at the LB logs and there is nothing strange (apart from the random strange response), the CPU is low. Where I can find out if some instance is "unhealthy" for 5 seconds?
The Apache logs shows no errors in the health pings or the requests.
Maybe there is some strange interaction between the "Balancing mode" and the session affinity?
The load balancer are thought to handle a considerable amount of requests. It balances the cargo of them pretty effective.
The issue here is your load balancer doesn't receive too many requests, then the change of just one request, can modify the load drastically, being an obstacle for the Load Balancer to work efficaciously.
TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer