ssh connection time out after reboot GCE instance - google-compute-engine

Can anyone tell me why after reboot a Google Compute Engine instance i get a ssh connection time out. I reboot the instance by sudo reboot and by Google Compute Engine Console and both do the same.

When the OS shuts down to reboot, all network connections are closed, including SSH connections. From the client side, this can look like a connection time out.

When you use gcutil resetinstance, it does the same thing as pushing the power button on a physical host. This is different from e.g. sudo reboot, because the former does not give the operating system a chance to perform any shutdown (like closing open sockets, flushing buffers, etc), while the latter does an orderly shutdown.
You should probably prefer logging in to the instance to do a reboot rather than using gcutil resetinstance if the host is still ssh-able; resetinstance (or the "Reboot Instance" button in the GUI) is a hard reset, which allows you to recover from a kernel crash or SSH failing.
In more detail:
During OS-initiated reboot (like sudo reboot), the operating system performs a number of cleanup steps and then moves to runlevel 6 (reboot). This causes all the scripts in /etc/init.d to be run and then a graceful shutdown. During a graceful shutdown, sshd will be killed; sshd could catch the kill signal to close all of its open sockets. Closing the socket will cause a FIN TCP packet to be sent, starting an orderly TCP teardown ("Connection closed" message in your ssh client). Alternatively, if sshd simply exits, the kernel sends a RST (reset) packet on all open TCP sockets, which will cause a "Connection reset" message on your ssh client. Once all the processes have been shut down, the kernel will make sure that all dirty pages in the page cache are flushed to disk, then execute one of two or three mechanisms to trigger a BIOS reboot. (ACPI, keyboard controller, or triple-fault.)
When triggering an external reset (e.g. via the resetinstance API call or GUI), the VM will go immediately to the last step, and the operating system won't have a chance to do any of the graceful shutdown steps above. This means your ssh client won't receive a FIN or RST packet like above, and will only notice the connection closed when the remote server stops responding. ("Connection timed out")

Thank you Brian Dorsey, E. Anderson and vgt for answering my question. The problem was other. Every time that i reseted the connection previously i up an ethernet bridge with the brigde-util utility between the "eth0" inferface and a new brigde interface called "br0". After reset the instance by sudo reboot or by GCE Console, ssh connection stopped working.
But if i don't up the ethernet bridge the instance restart ok by both methods.

If your instance image is CentOS, try to remove selinux.
sudo yum remove selinux*

Slightly orthogonal to Brian's answer. To gracefully reboot a GCE VM you can use:
gcutil resetinstance <instancename>

Related

Django does not gracefully close MySQL connection upon shutdown when run through uWSGI

I have a Django 2.2.6 running under uWSGI 2.0.18, with 6 pre-forking worker processes. Each of these has their own MySQL connection socket (600 second CONN_MAX_AGE). Everything works fine but when a worker process is recycled or shutdown, the uwsgi log ironically says:
Gracefully killing worker 6 (pid: 101)...
But MySQL says:
2020-10-22T10:15:35.923061Z 8 [Note] Aborted connection 8 to db: 'xxxx' user: 'xxxx' host: '172.22.0.5' (Got an error reading communication packets)
It doesn't hurt anything but the MySQL error log gets spammed full of these as I let uWSGI recycle the workers every 10 minutes and I have multiple servers.
It would be good if Django could catch the uWSGI worker process "graceful shutdown" and close the mysql socket before dying. Maybe it does and I'm configuring this setup wrong. Maybe it can't. I'll dig in myself but thought I'd ask as well..
If CONN_MAX_AGE is set to a positive value, then persistent connections are created by Django, that get cleaned up upon request start and request end. Clean up here, means if they are invalid, had too many errors or have been started longer than CONN_MAX_AGE seconds ago.
Otherwise, connections are closed at request close. So this problem occurs when you are using persistent connections and do uWSGI periodic reloads, by design.
There is this bit of code, that calls instructs uwsgi to shutdown all sockets, but I'm unsure if this is communicated to Django or that uwsgi uses a more brutal method and is causing the aborts. That shuts down all uwsgi owned sockets, so from the looks of it, unix sockets and connections with webserver. There's no hook either to be called just before or during reload.
Perhaps this get you on your way. :)

Asterisk Realtime Crashing on load when using HAProxy to Galera Cluster

Works fine under little load on our test bench but once we add to production the whole thing crashes and we are unable to get asterisk to function correctly. Almost as if there is a lag or delay in accessing the MariaDB cluster.
Our architecture and configs below;
Asterisk 13 Realtime with HAProxy(1.5.18) --> 6 x MariaDB(10.4.11) on independent Datacentres with Galera syncing them (1 only as backup)
Galera Sync is working fine and other services are able to read/write via the HAProxy 100%
Only seems to become and issue when we add load or we reload the dialplan or restart asterisk etc.
[haproxy.cfg]
global
user haproxy
group haproxy
defaults
mode http
log global
retries 2
timeout connect 3000ms
timeout server 10h
timeout client 10h
listen stats
bind *:8404
stats enable
stats hide-version
stats uri /stats
listen mysql-cluster
bind 127.0.0.1:3306
mode tcp
option mysql-check user haproxy_check
balance roundrobin
server mysql_server1 10.0.0.1:3306 check
server mysql_server2 10.0.0.2:3306 check
server mysql_server3 10.0.0.3:3306 check
server mysql_server4 10.0.0.4:3306 check
server mysql_server5 10.0.0.5:3306 check
server mysql_server6 10.0.0.6:3306 check backup
Really we would like to know if firstly Asterisk 13 Realtime will work via HAProxy and if so are there config changes we need to make to get it working.
Can provide more info if required
Try use Realtime->ODBC->haproxy.
If not help, use debugging, for example, gdb traces.
There is no way to determine what issue you have. Need more logs and configs.

Google Compute Engine: Internal DNS server and issues with the resolving

Since google Compute engine does not provides internal DNS i created 2 centos bind machines which will do the resolving for the machines on GCE and forward the resolvings over vpn to my private cloud and vice versa.
as the google cloud help docs suggests you can have this kind of scenario. and edit the resolv.conf on each instance to do the resolving.
What i did was edit the ifcg-eth0 to disable the PEERDNS and in /etc/resolv.conf
i added the search domain and top 2 nameservrs my instances.
now after one instance gets rebooted..it wont start again because its searching for the metadata.google.internal domain
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
What is the best practice in this kind of scenarios?
ty
Also i need the internal DNS for to do the poor's man round-robin failover, since GCE does not provides internal balancers.
As mentioned at https://cloud.google.com/compute/docs/networking:
Each instance's metadata server acts as a DNS server. It stores the DNS entries for all network IP addresses in the local network and calls Google's public DNS server for entries outside the network. You cannot configure this DNS server, but you can set up your own DNS server if you like and configure your instances to use that server instead by editing the /etc/resolv.conf file.
So you should be able to just use 169.254.169.254 for your DNS server. If you need to define external DNS entries, you might like Cloud DNS. If you set up a domain with Cloud DNS, or any other DNS provider, the 169.254.169.254 resolver should find it.
If you need something more complex, such as customer internal DNS names, then your own BIND server might be the best solution. Just make sure that metadata.google.internal. resolves to 169.254.169.254.
OK, I just ran in to this.. but unfortunately there was no timeout after 30 minutes that got it working. Fortunatly nelasx had correctly diagnosed it, and given the fix. I'm adding this to give the steps I had to take based on his excellent question and commented answer. I've just pulled the info I had to gather together in one place, to get to a solution.
Symptoms: on startup of the google instance - getting connection refused
After inspecting serial console output, will see:
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
You could try waiting, didn't work for me, and inspection of https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google-startup-scripts/usr/share/google/onboot
# Failed to resolve host or connect to host. Retry indefinitely.
6|7) sleep 1.0
log "Waiting for metadata server, attempt ${count}"
Led me to believe that will not work.
So, the solution was to fiddle with the disk, to add in nelasx's solution:
"edit ifcfg-eth and change PEERDNS=no edit /etc/resolv.conf and put on top your nameservers + search domain edit /etc/hosts and add: 169.254.169.254 metadata.google.internal"
To do this,
Best to create a snapshot backup before you start in case it goes awry
Uncheck "Delete boot disk when instance is deleted" for your instance
Delete the instance
Create a micro instance
Mount the disk
sudo ls -l /dev/disk/by-id/* # this will give you the name of the instances
sudo mkdir /mnt/new
sudo mount /dev/disk/by-id/scsi-0Google_PersistentDisk_instance-1-part1 /mnt/new
where instance-1 will be changed as per your setup
Go in an edit as per nelasx's solution - idiot trap I fell for - use a relative path - don't just sudo vi /etc/hosts use /mnt/new/etc/hosts - that cost me 15 more minutes as I had to go through the: got depressed, scratched head, kicked myself cycle.
Delete the debug instance, ensuring your attached disk delete option is unchecked
Create a new instance matching your original with the edited disk as your boot disk and fire it up.

Keeping active background processes in google compute engine

I am running several instances of ubuntu on google cloud.
Creating ssh tunnels to each instance with this command for every mini-server i have:
gcloud compute ssh --ssh-flag=-vvv "mini-server-1" --zone="us-central1-f" --ssh-flag="-D:5551" --ssh-flag="-N" --ssh-flag="-n" --ssh-flag="-4" --ssh-flag="-o" --ssh-flag="ServerAliveInterval=5" --ssh-flag="-o" --ssh-flag="ServerAliveCountMax=100000" &
Everything works fine, i even added cron job to check if connection is timed out each 10 minutes and restarts it. But when i log out from, seems like every tunnel dies. The script restarts the connections, i can see that from the log, but when i login back, ps -af | grep ssh shows nothing
Is there a way to make permanent tunnels that wont die upon logout ?
just exit with Ctrl + D and the process won't die.

Site to site OpenSWAN VPN tunnel issues with AWS

We have a VPN tunnel with Openswan between two AWS regions and our colo facility (Used AWS’s guide: http://aws.amazon.com/articles/5472675506466066). Regular usage works OK (ssh, etc), but we are having some MySQL issues over the tunnel between all areas. Using mysql command line client on a linux server and trying to connect using the MySQL Connector J it basically stalls… it seems to open the connection, but then gets stuck. It doesn't get denied or anything, just hangs there.
After initial research thought this was an MTU issue, but I've messed with that a lot and no luck.
Connection to the server works fine, and we can choose a database to use and such, but using the Java connector it appears that the Java client isn't receiving any network traffic after the query is made.
When running a select in the MySQL client on linux we can get a max of 2 or 3 rows before it goes dead.
With this said, I also have a separate openswan VPN on the AWS side for client (mac and iOS) vpn connections. Everything works fantastically through the client VPN and it seems more stable in general. The main difference I've noticed is that the static connection is using "tunnel" as the type and the client is using "transport", but when switching the static tunnel connection to transport it says there's like 30 open connections and doesn't work.
I'm very new to OpenSWAN, so hoping someone can help to point me in the right direction of getting the static tunnel working as well as the client VPN.
As always, here's my config files:
ipsec.conf for BOTH static tunnel servers:
# basic configuration
config setup
# Debug-logging controls: "none" for (almost) none, "all" for lots.
# klipsdebug=none
# plutodebug="control parsing"
# For Red Hat Enterprise Linux and Fedora, leave protostack=netkey
protostack=netkey
nat_traversal=yes
virtual_private=
oe=off
# Enable this if you see "failed to find any available worker"
# nhelpers=0
#You may put your configuration (.conf) file in the "/etc/ipsec.d/" and uncomment this.
include /etc/ipsec.d/*.conf
VPC1-to-colo tunnel conf
conn vpc1-to-DT
type=tunnel
authby=secret
left=%defaultroute
leftid=54.213.24.xxx
leftnexthop=%defaultroute
leftsubnet=10.1.4.0/24
right=72.26.103.xxx
rightsubnet=10.1.2.0/23
pfs=yes
auto=start
colo-to-VPC1 tunnel conf
conn DT-to-vpc1
type=tunnel
authby=secret
left=%defaultroute
leftid=72.26.103.xxx
leftnexthop=%defaultroute
leftsubnet=10.1.2.0/23
right=54.213.24.xxx
rightsubnet=10.1.4.0/24
pfs=yes
auto=start
Client point VPN ipsec.conf
# basic configuration
config setup
interfaces=%defaultroute
klipsdebug=none
nat_traversal=yes
nhelpers=0
oe=off
plutodebug=none
plutostderrlog=/var/log/pluto.log
protostack=netkey
virtual_private=%v4:10.1.4.0/24
conn L2TP-PSK
authby=secret
pfs=no
auto=add
keyingtries=3
rekey=no
type=transport
forceencaps=yes
right=%any
rightsubnet=vhost:%any,%priv
rightprotoport=17/0
# Using the magic port of "0" means "any one single port". This is
# a work around required for Apple OSX clients that use a randomly
# high port, but propose "0" instead of their port.
left=%defaultroute
leftprotoport=17/1701
# Apple iOS doesn't send delete notify so we need dead peer detection
# to detect vanishing clients
dpddelay=10
dpdtimeout=90
dpdaction=clear
Found the solution. Needed to add the following IP tables rule on both ends:
iptables -t mangle -I POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
This along with an MTU of 1400 and we're looking very solid
We had the same issue with a server connecting from the EU region to an RDS instance in the US. This appears to be a known issue with the RDS instances not responding to ICMP which is needed to auto-discover the MTU settings. As a workaround, you'll need to configure a smaller MTU on the instance that is performing the query.
On the server that is making the connection to the RDS instance (not the VPN tunnel instances), run the following command to get a MTU setting of 1422 (which worked for us):
sudo ifconfig eth0 mtu 1422