Why do my GCE instances auto-restart every 6 hours? - google-compute-engine

I've got the following setup:
Instance template for n1-standard-1 instances, HTTP(S) accessible, on SSD disks
Instance group with named ports for 80/443, autoscaling turned on with min/max=2/10 instances, target CPU=60%, cool-down=60s, and initial delay=600s
Group health check on port 80 every 10s with a threshold of 3 attempts
GCE HTTP(S) load balancer with above group as HTTP backend, max CPU=80%, health check identical to the one defined above for the group
Everything else is default. What I'm seeing from my graphs is that my 2 instances are regularly re-starting for no apparent reason. The instances both re-start every 6 hours, but staggered an hour apart so they're at least never down at the same time. The instance template is made from the disk of an instance that ran reliably (i.e. without regular, inexplicable re-starts) for months outside of an auto-scaling group. I've never seen one of my instances listed as unhealthy in the LB dashboard, but if I had to guess, I'd guess that my health checks are misconfigured somehow. Thanks.
Running "gcloud compute operations list" yields events of type "compute.instances.repair.recreateInstance" the correspond exactly to the periodic restarts. I have no idea why this is happening and haven't found any clues searching.

Your instances are restarted because they are probably unhealthy. Please check whether BackendSevrice.GetHealth(group) returns HEALTHY for all of the instances. If not, this might be case of your server as well as some misconfiguration in firewalls for range 130.211.0.0/22 (https://cloud.google.com/compute/docs/load-balancing/health-checks)

Related

Google Cloud SQL MySQL 2nd Gen Concurrent Connections?

The pricing page only gives this information for the 1st gen. Does anybody know the concurrent connection limits for the 2nd gen?
Second Generation instances are configured to allow up to 4000 connections though it does not mean that you can safely run your workload at 4000 connections for a given instance size. Different workloads will have different demands so you still need to monitor/benchmark your application to choose the appropriate instance size.
e.g. You might be able to make 4000 concurrent connections to a n1-standard-1 instance but it's unlikely to perform well for many workloads

AWS reading mysql replicas instances keeps gettin Too Many Connections error

I've purchased a single VPC on AWS and initiated there 6 MySql databases, and foreach one I've created a reading replica, so that I can always run queries on the reading replicas quickly.
Most of the day, my writing instances (original instances) are fully loaded and their CPUs percentage is mostly 99%. However, the reading replicas shows something ~7-10% CPU usage, but sometimes I get an error when I run a service connecting to the reading replica "TOO MANY CONNECTIONS".
I'm not that expert with AWS, but is this happening because the writing replicas are fully loaded and they're on the same VPC?
this happening because the writing replicas are fully loaded and they're on the same VPC?
No, it isn't. This is unrelated to replication. In replication, the replica counts as exactly 1 connection on the master, but replication does not consume any connections on the replica itself. There is no impact on connections related to the intensity of the total workload from replication.
This issue simply means you have more clients connecting to the replica than are allowed by the parameter group based on your RDS instance type. Use the query SELECT ##MAX_CONNECTIONS; to see what this limit is. Use SHOW STATUS LIKE 'THREADS_CONNECTED'; to see how many connections exist currently, and use SHOW PROCESSLIST; (as the administrative user, or any user holding the PROCESS privilege) in order to see what all of these connections are doing.
If many of them show Sleep and have long values in Time (seconds spent in the current state) then the problem is that your application is somehow abandoning connections, rather than properly closing them after use or when they are otherwise no longer needed.

AWS Elastic Load Balancing: Seeing extremely long initial connection time

For a couple of days, we often see an extremely long initial connection time (15s - 1.3 minutes) to our ELBs when making any request via ssl.
Oddly, I was only able to observe this in Google Chrome (not Safari nor Firefox nor curl).
It does not occur every single request, but around 50% of requests. It occurs with the first request (OPTIONS-call).
Our setup is the following:
Cross-Zone ELB that connects to a node.js backend (currently in 2 AZs in eu-west-1). All instances are healthy and once the request comes through, it is processed normally. Currently, there is basically no load on the system. Cloudwatch for ELB does not report any backend connection errors, neither a SurgeQueue (value 0) nor a spillover count. The ELB metrics show a low latency (< 100 ms).
We have Route53 configured to route to the ELB (we don't see any dns trouble, see attached screenshot).
We have different REST-APIs that all have this setup. It occurs to all of the ELBs (each of them is connecting to an indipendent node.js backend). All of these ELBs are set up the same way via our cloudformation template.
The ELBs also do our SSL-termination.
What could lead to such a behavior? Is it possible that the ELBs are not configured properly? And why could it only appear on Google Chrome?
I think it is a possible ELB misconfiguration. I had the same problem when I put private subnets to ELB. Fixed it by changing private subnets to public. See https://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-manage-subnets.html
Just to follow up on #Nikita Ogurtsov's excellent answer; I had the same problem except that it was just one of my subnets that happened to be private and the rest public.
Even if you think your subnets are public, I recommend you double check the route tables to ensure that they all have a Gateway.
You can use a single Route Table that has a Gateway for all your LB subnets if this make sense
VPC/Subnets/(select subnet)/Route Table/Edit
For me the issue was that I had an unused "Availability Zone" in my Classic Load Balancer. Once I removed the unhealthy and unused Availability Zone the consistent 20 or 21 second delay in "Initial Connection" dropped to under 50ms.
Note: You may need to give it time to update. I had my DNS TTL set to 60 seconds so I would see the fix within a minute of removing the unused Availability Zone.
This can be a problem with the elb of amazon. The elb scale the number of instances with the number of request.
You should see some pick of requests at those times.
Amazon adds some instances in order to fit the load.
the instances are reachable during the launch process so your clients get those timeout. it's totally randomness so you should :
ping the elb in order to get all the ip used
use mtr on all ip found
Keep an eye on CloudWatch
Find some clues
Solution If you're DNS is configured to hit directly on the ELB -> you should reduce the TTL of the association (IP,DNS). The IP can change at any time with the ELB so you can have serious damage on your traffic.
The client keep Some IP from the ELB in cache so you can have those can of trouble.
Scaling Elastic Load Balancers
Once you create an elastic load balancer, you must configure it to accept incoming traffic and route requests to your EC2 instances. These configuration parameters are stored by the controller, and the controller ensures that all of the load balancers are operating with the correct configuration. The controller will also monitor the load balancers and manage the capacity that is used to handle the client requests. It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. By default, Elastic Load Balancing will return multiple IP addresses when clients perform a DNS resolution, with the records being randomly ordered on each DNS resolution request. As the traffic profile changes, the controller service will scale the load balancers to handle more requests, scaling equally in all Availability Zones.
Best Practices ELB on AWS
ALB Loadbalancer need 2 Availability Zones. If you use a Privat/Public/Nat VPC setting, then must all public subnets have a connection to the Internet.
For me the issue was that the ALB was pointing to an Nginx instance, which had a misconfigured DNS resolver. This meant that Nginx tried to use the resolver, timed out, and then actually started working a bit later.
Not really super connected with Load Balancer itself, but maybe helps someone figure out the issue in their own setup.
Check a security group too. That was an issue in my case.
I see a similar problem in my Chrome logs (1.3m lag). It happens in an OPTIONS request, and from wireshark, I don't even see the request leaving the PC in the first place. Any suggestions as to what Chrome might be doing are welcome.
We have recently encountered chrome taking 1.3 mins to load pages but the cause was slightly different. Just popping it here incase it helps someone.
1.3 mins seems to be how long Chrome will wait when trying to connect to a specific IP. Our domain name has multiple IP addresses in the A record (similar to a CNAME setup) and one of those IP's belonged to a server that had crashed. So sometimes the browser would connect quickly because it used a valid IP and sometimes we would get the long wait as it tried to connect to the invalid IP, timed out, and then retried with a valid IP.
So it is worth checking that all the IP's listed when you dig your domain, are resolving correctly.

Google VM Instance becomes unhealthy on its own

I have been using Google Cloud for quite some time and everything works fine. I was using single VM Instance to host both website and MySQL Database.
Recently, i decided to move the website to autoscale so that on days when the traffic increases, the website doesn't go down.
So, i moved the database to Cloud SQL and create a VM Group which will host the PHP, HTML, Image files. Then, i set up a load balancer to divert traffic to various VM Instances under VM Group.
The problem is that the Backend Service (VM Group inside load balancer) becomes unhealthy on its own after working fine for 5-6 hours and then again becomes healthy after 10-15 minutes. I have also seen that the problem can come when i run a file which is a bit lengthy with many MySQL Queries.
I checked the Health check and it was giving 200 response. During the down period of 10-15 minutes, the VM Instance is accessible from it own ip address.
Everything is same, i have just added a load balancer in front of the VM Instance and the problem has started.
Can anybody help me troubleshoot this problem?
It sounds like your server is timing out (blocking?) on the health check during the times the load balancer reports it as down. A few things you can check:
The logs (I'm presuming you're using Apache?) should include a duration along with the request status in the logs. The default health check timeout is 5s, so if your health check is returning a 200 in 6s, the health checker will time out after 5s and treat the host as down.
You mention that a heavy mysql load can cause the problem. Have you looked at disk I/O statistics and CPU to make sure that this isn't a load-related problem? If this is CPU or load related, you might look at increasing either CPU or disk size, or moving your disk from spindle-backed to SSD-backed storage.
Have you checked that you have sufficient threads available? Ideally, your health check would run fairly quickly, but it might be delayed (for example) if you have 3 threads and all three are busy running some other PHP script that's waiting on the database

mySQL "Too many connections" error influenced by number of mongrel instances?

Recently I have started getting mySQL "too many connection" errors at times of high traffic. My rails app runs on a mongrel cluster with 2 instances on a shared host. Some recent changes that might be driving it:
Traffic to my site has increased. I
am now averaging about 4K pages a
day.
Database size has increased. My largest table has ~ 100K rows.
Some associations could return
several hundred instances in the
worst case, though most are far less.
I have added some features that
increased the number and size of
database calls in some actions.
I have done a code review to reduce database calls, optimize SQL queries, add missing indexes, and use :include for eager loading. However, many of my methods still make 5-10 separate SQL calls. Most of my actions have a response time of around 100ms, but one of my most common actions averages 300-400ms, and some actions randomly peak at over 1000ms.
The logs are of little help, as the errors seem to occur randomly, or at least the pattern does not appear related to the actions being called or data being accessed.
Could I alleviate the error by adding additional mongrel instances? Or are the mySQL connections limited by the server, and thus unrelated to the number of processes I divide my traffic across?
Is this most likely a problem with my coding, or should I be pressing my host for more capacity/less load on the shared server?
ActiveRecord has pooled database connections since Rails 2.2, and it's likely that that's what's causing your excess connections here. Try turning down the value of pool in your database.yml for that environment (it defaults to 5).
Docs can be found here.
Are you caching anything? It's an important part of alleviating application and database load. The Rails Guides have a section on caching.
Something is wrong. A Mongrel instance processes 1 request at a time so if you have 2 Mongrel instances then you should not be seeing more than 2 active MySQL connections (from the mongrels at least)
You could log or graph the output of SHOW STATUS LIKE 'Threads_connected' over time.
PS: this is not very many Mongrels. if you want to be able to service more than 2 simultaneous requests then you'll want more. ...if memory is tight, you can switch to Phusion Passenger and REE.