Keeping active background processes in google compute engine - google-compute-engine

I am running several instances of ubuntu on google cloud.
Creating ssh tunnels to each instance with this command for every mini-server i have:
gcloud compute ssh --ssh-flag=-vvv "mini-server-1" --zone="us-central1-f" --ssh-flag="-D:5551" --ssh-flag="-N" --ssh-flag="-n" --ssh-flag="-4" --ssh-flag="-o" --ssh-flag="ServerAliveInterval=5" --ssh-flag="-o" --ssh-flag="ServerAliveCountMax=100000" &
Everything works fine, i even added cron job to check if connection is timed out each 10 minutes and restarts it. But when i log out from, seems like every tunnel dies. The script restarts the connections, i can see that from the log, but when i login back, ps -af | grep ssh shows nothing
Is there a way to make permanent tunnels that wont die upon logout ?

just exit with Ctrl + D and the process won't die.

Related

Internal 500 error on Google Compute Engine, installing littlest jupyter

"Internal 500 server error" after VM runs for a day or two.
This is the second time it has happened, I start the instance, install littlest Jupyterhub
(see details below). I can login to the external ip, for a day, but then it stops
with internal 500 error. I cannot ssh or get into the instance, only alternate is to
create a new instance and re-do. What is the problem?
I have installed littlest jupyterhub using on this instance, using
#!/bin/bash
curl https://raw.githubusercontent.com/jupyterhub/the-littlest-jupyterhub/master/bootstrap/bootstrap.py | sudo python3 - --admin master
I would recommend you enable access on your instance to the serial console [1].
You will also need to setup a password for your user following this documentation [2].
With these two steps done, you should be able to reconnect to your instance once you are locked out like you mentioned by following this [3].
You should then be able to investigate what is going on in the instance.
Then try to verify if your application is still running, if the SSH server is still running etc.
Frederic
[1] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#enable_instance_access
[2] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#setting_up_a_local_password
[3] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#connectserialconsole

Google Cloud instance can't be accessed via SSH after cloning

I'm desperate for help here. I have a compute engine instance that hosts a lot of websites. These are the steps that I took:
Go to Compute Engine > Snapshots and take a snapshot of my instance
Click on the newly created snapshot and click Create Instance.
The new instance has all the configs of the current running instance
Then when I tried to access the new instance via SSH, it wouldn't work. Error message:
"Connection Failed
We are unable to connect to the VM on port 22. Learn more about possible causes of this issue."
Clicking on Learn more gets me to https://cloud.google.com/compute/docs/ssh-in-browser#ssherror
The instance is booting up and sshd is not yet running - Not sure how to check this
The instance is not running sshd - Not sure how to check this either
sshd is listening on a port other than the one you are connecting to - My current instance is having ssh running on port 22 so I guess this is fine?
There is no firewall rule allowing SSH access on the port - Again, my current instance is having ssh running so I don't think it's because of firewall, right?
The firewall rule allowing SSH access is enabled, but is not configured to allow connections from GCP Console services. - Same as above
The instance is shut down - Instance is still running.
Strange thing is if I create a fresh instance from scratch and then do the steps above to clone to a new instance then that new instance can be accessed normally via SSH.
Can anyone show me how to fix this if possible? Or show me how to see logs, check for what went wrong etc as I tried to google but pretty confused with all the jargons or where to find a particular stuff. Sorry for the wall of text. Thanks
**
Edit #1
**: I got technical support from Google. The steps below might help someone else, but not me as when I reached step 7, I waited forever and couldn't get to the login page.
1.) Go to the VM instances page and click on the Instance name of your VM.
2.) Click the Edit button at the top of the page.
3.) Under Custom metadata, click Add item.
4.) Set 'Key' to 'startup-script' and set 'Value' to this script:
#! /bin/bash
useradd -G sudo USERNAME
echo 'USERNAME:PASSWORD' | chpasswd
NOTE: change the value of USERNAME and PASSWORD to the name and password of your choice.
5.) Enable "Enable connecting to serial ports" by checking the box below the SSH button.
6.) Click Save and then click RESET on the top of the page. Wait for some time for the instance to reboot.
7.) Click on 'Connect to serial port' in the page. In the new window, you might need to wait a bit and press on Enter of your keyboard once; then, you should see the login prompt.
8.) Login using the USERNAME and PASSWORD you provided.
Note: Please do not share any of your password and username for your data security.
As those steps above couldn't help me and the Google support representative looked at the log but didn't see anything wrong, she suggested to debug SSH following this guide https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh#use_your_disk_on_a_new_instance which I will do when I have time. Feel like I'm writing an essay. Will keep posted
The troubleshooting steps that you can follow are:
Use the serial console to view your instance logs and check whether the new instance you created from the snapshot failed to start to the appropriate run level where the ssh daemon would get started. If sshd was not started you would not have ssh access to your instance.
You can try restarting the instance if it doesn’t affect production and try to gain ssh access again. Might be that some issue prevented the instance from starting up properly and restarting it could fix it.
You can try creating another VM instance from the snapshot in case the previous instance wasn’t created properly.
If creating a new VM instance from the snapshot doesn’t fix the issue, it might be that the snapshot itself wasn’t created properly. You can read this documentation guide, section Understanding snapshot best practices, and try creating another snapshot and VM instances from it.
I had the same problem and after a lot of searching, I found an answer from user Peripheral from ServerFault that worked for me.
I found the fix for me. A recent update has a known issue where it removes the default gateway from the iptables. To fix it, I have to go to the instance and select Edit. Scroll down, and under Custom Metadata put the following:
key: startup-script
value: route add default gw <gatewayIP> eth0
Save and restart the VM.
Source
All credits to him/her, just want to share to help others find their solution faster.
I had the same issue. I eventually figured that it was because I attached a persistent disk added an entry into the /etc/fstab file. This entry is supposed to automatically mount the attached disk upon restart of the instance.
However, when I created a snapshot of the boot disk, I didn't remove the /etc/fstab entry. So creating a new instance from this snapshot will always cause a boot error as the script tries to mount a disk that is not attached.
This information is present in the documentation

Google Compute Engine: Internal DNS server and issues with the resolving

Since google Compute engine does not provides internal DNS i created 2 centos bind machines which will do the resolving for the machines on GCE and forward the resolvings over vpn to my private cloud and vice versa.
as the google cloud help docs suggests you can have this kind of scenario. and edit the resolv.conf on each instance to do the resolving.
What i did was edit the ifcg-eth0 to disable the PEERDNS and in /etc/resolv.conf
i added the search domain and top 2 nameservrs my instances.
now after one instance gets rebooted..it wont start again because its searching for the metadata.google.internal domain
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
What is the best practice in this kind of scenarios?
ty
Also i need the internal DNS for to do the poor's man round-robin failover, since GCE does not provides internal balancers.
As mentioned at https://cloud.google.com/compute/docs/networking:
Each instance's metadata server acts as a DNS server. It stores the DNS entries for all network IP addresses in the local network and calls Google's public DNS server for entries outside the network. You cannot configure this DNS server, but you can set up your own DNS server if you like and configure your instances to use that server instead by editing the /etc/resolv.conf file.
So you should be able to just use 169.254.169.254 for your DNS server. If you need to define external DNS entries, you might like Cloud DNS. If you set up a domain with Cloud DNS, or any other DNS provider, the 169.254.169.254 resolver should find it.
If you need something more complex, such as customer internal DNS names, then your own BIND server might be the best solution. Just make sure that metadata.google.internal. resolves to 169.254.169.254.
OK, I just ran in to this.. but unfortunately there was no timeout after 30 minutes that got it working. Fortunatly nelasx had correctly diagnosed it, and given the fix. I'm adding this to give the steps I had to take based on his excellent question and commented answer. I've just pulled the info I had to gather together in one place, to get to a solution.
Symptoms: on startup of the google instance - getting connection refused
After inspecting serial console output, will see:
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
You could try waiting, didn't work for me, and inspection of https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google-startup-scripts/usr/share/google/onboot
# Failed to resolve host or connect to host. Retry indefinitely.
6|7) sleep 1.0
log "Waiting for metadata server, attempt ${count}"
Led me to believe that will not work.
So, the solution was to fiddle with the disk, to add in nelasx's solution:
"edit ifcfg-eth and change PEERDNS=no edit /etc/resolv.conf and put on top your nameservers + search domain edit /etc/hosts and add: 169.254.169.254 metadata.google.internal"
To do this,
Best to create a snapshot backup before you start in case it goes awry
Uncheck "Delete boot disk when instance is deleted" for your instance
Delete the instance
Create a micro instance
Mount the disk
sudo ls -l /dev/disk/by-id/* # this will give you the name of the instances
sudo mkdir /mnt/new
sudo mount /dev/disk/by-id/scsi-0Google_PersistentDisk_instance-1-part1 /mnt/new
where instance-1 will be changed as per your setup
Go in an edit as per nelasx's solution - idiot trap I fell for - use a relative path - don't just sudo vi /etc/hosts use /mnt/new/etc/hosts - that cost me 15 more minutes as I had to go through the: got depressed, scratched head, kicked myself cycle.
Delete the debug instance, ensuring your attached disk delete option is unchecked
Create a new instance matching your original with the edited disk as your boot disk and fire it up.

First connect from Prestashop to Google Cloud SQL always fails

I'm setting up a PrestaShop installation on a development server which is a GCE instance and using Cloud SQL as a database server. Everything works just fine except one thing: whenever there is a long period of inactivity on the site, the first page load after that always gives me this error:
Link to database cannot be established: SQLSTATE[HY000] [2003]
If I refresh the page the error is gone and never appears again until I stop using the site for an hour or so. It almost looks like database instance is going into sleep mode or something like that.
The reason I mentioned Prestashop is the fact that I never get this error when using Adminer or connecting to the database from mysql console client.
With the per use billing model, instances are spun down after a 15 minute timeout to save you money. They then take a few seconds to be spun up when next accessed. It may be the Prestashop is timing out on these first requests (though I have no experience with that application).
Try changing your instance to a package billing, which has a 12 hour timeout, to see if this helps
https://developers.google.com/cloud-sql/faq#how_usage_calculated
According to GCE documentation,
Once a connection has been established with an instance, traffic is permitted in both directions over that connection, until the connection times out after 10 minutes of inactivity
I suspect that might be the cause. To get around it, you can try to lower the tcp keepalive time.
Refer here: https://cloud.google.com/sql/docs/compute-engine-access
To keep long-lived unused connections alive, you can set the TCP keepalive. The following commands set the TCP keepalive value to one minute and make the configuration permanent across instance reboots.
# Display the current tcp_keepalive_time value.
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
# Set tcp_keepalive_time to 60 seconds and make it permanent across reboots.
$ echo 'net.ipv4.tcp_keepalive_time = 60' | sudo tee -a /etc/sysctl.conf
# Apply the change.
$ sudo /sbin/sysctl --load=/etc/sysctl.conf
# Display the tcp_keepalive_time value to verify the change was applied.
$ cat /proc/sys/net/ipv4/tcp_keepalive_time

ssh connection time out after reboot GCE instance

Can anyone tell me why after reboot a Google Compute Engine instance i get a ssh connection time out. I reboot the instance by sudo reboot and by Google Compute Engine Console and both do the same.
When the OS shuts down to reboot, all network connections are closed, including SSH connections. From the client side, this can look like a connection time out.
When you use gcutil resetinstance, it does the same thing as pushing the power button on a physical host. This is different from e.g. sudo reboot, because the former does not give the operating system a chance to perform any shutdown (like closing open sockets, flushing buffers, etc), while the latter does an orderly shutdown.
You should probably prefer logging in to the instance to do a reboot rather than using gcutil resetinstance if the host is still ssh-able; resetinstance (or the "Reboot Instance" button in the GUI) is a hard reset, which allows you to recover from a kernel crash or SSH failing.
In more detail:
During OS-initiated reboot (like sudo reboot), the operating system performs a number of cleanup steps and then moves to runlevel 6 (reboot). This causes all the scripts in /etc/init.d to be run and then a graceful shutdown. During a graceful shutdown, sshd will be killed; sshd could catch the kill signal to close all of its open sockets. Closing the socket will cause a FIN TCP packet to be sent, starting an orderly TCP teardown ("Connection closed" message in your ssh client). Alternatively, if sshd simply exits, the kernel sends a RST (reset) packet on all open TCP sockets, which will cause a "Connection reset" message on your ssh client. Once all the processes have been shut down, the kernel will make sure that all dirty pages in the page cache are flushed to disk, then execute one of two or three mechanisms to trigger a BIOS reboot. (ACPI, keyboard controller, or triple-fault.)
When triggering an external reset (e.g. via the resetinstance API call or GUI), the VM will go immediately to the last step, and the operating system won't have a chance to do any of the graceful shutdown steps above. This means your ssh client won't receive a FIN or RST packet like above, and will only notice the connection closed when the remote server stops responding. ("Connection timed out")
Thank you Brian Dorsey, E. Anderson and vgt for answering my question. The problem was other. Every time that i reseted the connection previously i up an ethernet bridge with the brigde-util utility between the "eth0" inferface and a new brigde interface called "br0". After reset the instance by sudo reboot or by GCE Console, ssh connection stopped working.
But if i don't up the ethernet bridge the instance restart ok by both methods.
If your instance image is CentOS, try to remove selinux.
sudo yum remove selinux*
Slightly orthogonal to Brian's answer. To gracefully reboot a GCE VM you can use:
gcutil resetinstance <instancename>