AWS Elastic Load Balancing: Seeing extremely long initial connection time

AWS Elastic Load Balancing: Seeing extremely long initial connection time - google-chrome

For a couple of days, we often see an extremely long initial connection time (15s - 1.3 minutes) to our ELBs when making any request via ssl.
Oddly, I was only able to observe this in Google Chrome (not Safari nor Firefox nor curl).
It does not occur every single request, but around 50% of requests. It occurs with the first request (OPTIONS-call).
Our setup is the following:
Cross-Zone ELB that connects to a node.js backend (currently in 2 AZs in eu-west-1). All instances are healthy and once the request comes through, it is processed normally. Currently, there is basically no load on the system. Cloudwatch for ELB does not report any backend connection errors, neither a SurgeQueue (value 0) nor a spillover count. The ELB metrics show a low latency (< 100 ms).
We have Route53 configured to route to the ELB (we don't see any dns trouble, see attached screenshot).
We have different REST-APIs that all have this setup. It occurs to all of the ELBs (each of them is connecting to an indipendent node.js backend). All of these ELBs are set up the same way via our cloudformation template.
The ELBs also do our SSL-termination.
What could lead to such a behavior? Is it possible that the ELBs are not configured properly? And why could it only appear on Google Chrome?

I think it is a possible ELB misconfiguration. I had the same problem when I put private subnets to ELB. Fixed it by changing private subnets to public. See https://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-manage-subnets.html

Just to follow up on #Nikita Ogurtsov's excellent answer; I had the same problem except that it was just one of my subnets that happened to be private and the rest public.
Even if you think your subnets are public, I recommend you double check the route tables to ensure that they all have a Gateway.
You can use a single Route Table that has a Gateway for all your LB subnets if this make sense
VPC/Subnets/(select subnet)/Route Table/Edit

For me the issue was that I had an unused "Availability Zone" in my Classic Load Balancer. Once I removed the unhealthy and unused Availability Zone the consistent 20 or 21 second delay in "Initial Connection" dropped to under 50ms.
Note: You may need to give it time to update. I had my DNS TTL set to 60 seconds so I would see the fix within a minute of removing the unused Availability Zone.

This can be a problem with the elb of amazon. The elb scale the number of instances with the number of request.
You should see some pick of requests at those times.
Amazon adds some instances in order to fit the load.
the instances are reachable during the launch process so your clients get those timeout. it's totally randomness so you should :
ping the elb in order to get all the ip used
use mtr on all ip found
Keep an eye on CloudWatch
Find some clues

Solution If you're DNS is configured to hit directly on the ELB -> you should reduce the TTL of the association (IP,DNS). The IP can change at any time with the ELB so you can have serious damage on your traffic.
The client keep Some IP from the ELB in cache so you can have those can of trouble.
Scaling Elastic Load Balancers
Once you create an elastic load balancer, you must configure it to accept incoming traffic and route requests to your EC2 instances. These configuration parameters are stored by the controller, and the controller ensures that all of the load balancers are operating with the correct configuration. The controller will also monitor the load balancers and manage the capacity that is used to handle the client requests. It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. By default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request. As the traffic profile changes, the controller service will scale the load balancers to handle more requests, scaling equally in all Availability Zones.
Best Practices ELB on AWS

ALB Loadbalancer need 2 Availability Zones. If you use a Privat/Public/Nat VPC setting, then must all public subnets have a connection to the Internet.

For me the issue was that the ALB was pointing to an Nginx instance, which had a misconfigured DNS resolver. This meant that Nginx tried to use the resolver, timed out, and then actually started working a bit later.
Not really super connected with Load Balancer itself, but maybe helps someone figure out the issue in their own setup.

Check a security group too. That was an issue in my case.

I see a similar problem in my Chrome logs (1.3m lag). It happens in an OPTIONS request, and from wireshark, I don't even see the request leaving the PC in the first place. Any suggestions as to what Chrome might be doing are welcome.

We have recently encountered chrome taking 1.3 mins to load pages but the cause was slightly different. Just popping it here incase it helps someone.
1.3 mins seems to be how long Chrome will wait when trying to connect to a specific IP. Our domain name has multiple IP addresses in the A record (similar to a CNAME setup) and one of those IP's belonged to a server that had crashed. So sometimes the browser would connect quickly because it used a valid IP and sometimes we would get the long wait as it tried to connect to the invalid IP, timed out, and then retried with a valid IP.
So it is worth checking that all the IP's listed when you dig your domain, are resolving correctly.

Related

Openshift HAproxy sticky session issue

I have a deployment with 2 pods of a web application. The web application requires a logon and maintains a session.
After i kill the first pod i am automatically redirected to the logon page of the second pod, but when the first pod is loaded again i am redirected back to it.
I have tried to use the HAproxy "balance source" algorithm and coockies.
Any idea why doesn't it stay with the second pod?

balance source uses a hashing algorithm that changes the workload distribution every time the number of available backends changes, because that is what it's designed to do. If you had more than 2 backends, you would also find that taking down any one backend will cause some traffic that wasn't even hitting the impacted backend to shift to another, because of this redistribution.
If the hash result changes due to the number of running servers changing, many clients will be directed to a different server.
http://cbonte.github.io/haproxy-dconv/1.6/configuration.html#4-balance
For an explanation of why you didn't see the expected behavior when using cookies instead of balance source, we'd need to see your configuration.

Why do my GCE instances auto-restart every 6 hours?

I've got the following setup:
Instance template for n1-standard-1 instances, HTTP(S) accessible, on SSD disks
Instance group with named ports for 80/443, autoscaling turned on with min/max=2/10 instances, target CPU=60%, cool-down=60s, and initial delay=600s
Group health check on port 80 every 10s with a threshold of 3 attempts
GCE HTTP(S) load balancer with above group as HTTP backend, max CPU=80%, health check identical to the one defined above for the group
Everything else is default. What I'm seeing from my graphs is that my 2 instances are regularly re-starting for no apparent reason. The instances both re-start every 6 hours, but staggered an hour apart so they're at least never down at the same time. The instance template is made from the disk of an instance that ran reliably (i.e. without regular, inexplicable re-starts) for months outside of an auto-scaling group. I've never seen one of my instances listed as unhealthy in the LB dashboard, but if I had to guess, I'd guess that my health checks are misconfigured somehow. Thanks.
Running "gcloud compute operations list" yields events of type "compute.instances.repair.recreateInstance" the correspond exactly to the periodic restarts. I have no idea why this is happening and haven't found any clues searching.

Your instances are restarted because they are probably unhealthy. Please check whether BackendSevrice.GetHealth(group) returns HEALTHY for all of the instances. If not, this might be case of your server as well as some misconfiguration in firewalls for range 130.211.0.0/22 (https://cloud.google.com/compute/docs/load-balancing/health-checks)

Google VM Instance becomes unhealthy on its own

I have been using Google Cloud for quite some time and everything works fine. I was using single VM Instance to host both website and MySQL Database.
Recently, i decided to move the website to autoscale so that on days when the traffic increases, the website doesn't go down.
So, i moved the database to Cloud SQL and create a VM Group which will host the PHP, HTML, Image files. Then, i set up a load balancer to divert traffic to various VM Instances under VM Group.
The problem is that the Backend Service (VM Group inside load balancer) becomes unhealthy on its own after working fine for 5-6 hours and then again becomes healthy after 10-15 minutes. I have also seen that the problem can come when i run a file which is a bit lengthy with many MySQL Queries.
I checked the Health check and it was giving 200 response. During the down period of 10-15 minutes, the VM Instance is accessible from it own ip address.
Everything is same, i have just added a load balancer in front of the VM Instance and the problem has started.
Can anybody help me troubleshoot this problem?

It sounds like your server is timing out (blocking?) on the health check during the times the load balancer reports it as down. A few things you can check:
The logs (I'm presuming you're using Apache?) should include a duration along with the request status in the logs. The default health check timeout is 5s, so if your health check is returning a 200 in 6s, the health checker will time out after 5s and treat the host as down.
You mention that a heavy mysql load can cause the problem. Have you looked at disk I/O statistics and CPU to make sure that this isn't a load-related problem? If this is CPU or load related, you might look at increasing either CPU or disk size, or moving your disk from spindle-backed to SSD-backed storage.
Have you checked that you have sufficient threads available? Ideally, your health check would run fairly quickly, but it might be delayed (for example) if you have 3 threads and all three are busy running some other PHP script that's waiting on the database

Test performance in Openshift and prevent get banned IP

I have an application hosted in openshift. Now I want figure out how many request can handle in order to check the speed and availability.
So my first attempt will be generate a multiple HTTP GET requests to my Rest Service(made in python and hosted in openshift).
My fear is can get my IP workplace banned regarding this looks like an attack.
In the other hand I see there are tools like New Relic or DataDog to check metrics, but I don't know if I can simulate http requests and then check the response times.
Openshift Response
I finally wrote to Openshift support and they told me I can simulate http requests without worries.

I recall the default behavior being that each gear can handle 16 concurrent connections, then auto-scaling would kick in and you would get a new gear. Therefore I would think it makes sense to start by testing that a gear works well with 16 users at once. If not, then you can change the scaling policy to what works best for you application.
BlazeMeter is a tool that could probably help with creating the connections. They mention 100,000 concurrent users on that main page so I don't think you have to worry about getting banned for this sort of test.

Load balancing many services on few GCE nodes

I have 2 GCE nodes, each running the same N services. For each service, I use the GCE network load balancer to distribute requests to the 2 nodes. I therefore created the following setup:
Since I want the load balancer to check the health of each service separately, I have a health check for each of the N services (every health check checks a different port for an HTTP response)
Since each service has its own health check, I have N target-pools, all of them just containing node 1 and 2, but all with a different health check.
Since I have N target pools, I also have N forwarding rules
Since I want each of these load balanced services to be available externally (actually, from within GAE), I assign each of the forwarding rules a static IP address
The problem is that I have more than 7 services I want to run, and the regional quota of GCE only allow 7 static IP addresses. This makes me suspect I'm doing something wrong, and there's a better design for what I'm doing.
The root of my problem seems to be that I want a health check for each service (instead of each node), which I can only seem to do if I split up the entire path up to the forwarding rule in the GCE network load balancer.

Your configuration looks reasonable, given that each service has its own dedicated health-check.
Note that if you need more than the default resource quotas and your project is not in Free Trial stage, you can request more quota using the quota change request form.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008