Google Cloud instance can't be accessed via SSH after cloning - google-compute-engine

I'm desperate for help here. I have a compute engine instance that hosts a lot of websites. These are the steps that I took:
Go to Compute Engine > Snapshots and take a snapshot of my instance
Click on the newly created snapshot and click Create Instance.
The new instance has all the configs of the current running instance
Then when I tried to access the new instance via SSH, it wouldn't work. Error message:
"Connection Failed
We are unable to connect to the VM on port 22. Learn more about possible causes of this issue."
Clicking on Learn more gets me to https://cloud.google.com/compute/docs/ssh-in-browser#ssherror
The instance is booting up and sshd is not yet running - Not sure how to check this
The instance is not running sshd - Not sure how to check this either
sshd is listening on a port other than the one you are connecting to - My current instance is having ssh running on port 22 so I guess this is fine?
There is no firewall rule allowing SSH access on the port - Again, my current instance is having ssh running so I don't think it's because of firewall, right?
The firewall rule allowing SSH access is enabled, but is not configured to allow connections from GCP Console services. - Same as above
The instance is shut down - Instance is still running.
Strange thing is if I create a fresh instance from scratch and then do the steps above to clone to a new instance then that new instance can be accessed normally via SSH.
Can anyone show me how to fix this if possible? Or show me how to see logs, check for what went wrong etc as I tried to google but pretty confused with all the jargons or where to find a particular stuff. Sorry for the wall of text. Thanks
**
Edit #1
**: I got technical support from Google. The steps below might help someone else, but not me as when I reached step 7, I waited forever and couldn't get to the login page.
1.) Go to the VM instances page and click on the Instance name of your VM.
2.) Click the Edit button at the top of the page.
3.) Under Custom metadata, click Add item.
4.) Set 'Key' to 'startup-script' and set 'Value' to this script:
#! /bin/bash
useradd -G sudo USERNAME
echo 'USERNAME:PASSWORD' | chpasswd
NOTE: change the value of USERNAME and PASSWORD to the name and password of your choice.
5.) Enable "Enable connecting to serial ports" by checking the box below the SSH button.
6.) Click Save and then click RESET on the top of the page. Wait for some time for the instance to reboot.
7.) Click on 'Connect to serial port' in the page. In the new window, you might need to wait a bit and press on Enter of your keyboard once; then, you should see the login prompt.
8.) Login using the USERNAME and PASSWORD you provided.
Note: Please do not share any of your password and username for your data security.
As those steps above couldn't help me and the Google support representative looked at the log but didn't see anything wrong, she suggested to debug SSH following this guide https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh#use_your_disk_on_a_new_instance which I will do when I have time. Feel like I'm writing an essay. Will keep posted

The troubleshooting steps that you can follow are:
Use the serial console to view your instance logs and check whether the new instance you created from the snapshot failed to start to the appropriate run level where the ssh daemon would get started. If sshd was not started you would not have ssh access to your instance.
You can try restarting the instance if it doesn’t affect production and try to gain ssh access again. Might be that some issue prevented the instance from starting up properly and restarting it could fix it.
You can try creating another VM instance from the snapshot in case the previous instance wasn’t created properly.
If creating a new VM instance from the snapshot doesn’t fix the issue, it might be that the snapshot itself wasn’t created properly. You can read this documentation guide, section Understanding snapshot best practices, and try creating another snapshot and VM instances from it.

I had the same problem and after a lot of searching, I found an answer from user Peripheral from ServerFault that worked for me.
I found the fix for me. A recent update has a known issue where it removes the default gateway from the iptables. To fix it, I have to go to the instance and select Edit. Scroll down, and under Custom Metadata put the following:
key: startup-script
value: route add default gw <gatewayIP> eth0
Save and restart the VM.
Source
All credits to him/her, just want to share to help others find their solution faster.

I had the same issue. I eventually figured that it was because I attached a persistent disk added an entry into the /etc/fstab file. This entry is supposed to automatically mount the attached disk upon restart of the instance.
However, when I created a snapshot of the boot disk, I didn't remove the /etc/fstab entry. So creating a new instance from this snapshot will always cause a boot error as the script tries to mount a disk that is not attached.
This information is present in the documentation

Related

Internal 500 error on Google Compute Engine, installing littlest jupyter

"Internal 500 server error" after VM runs for a day or two.
This is the second time it has happened, I start the instance, install littlest Jupyterhub
(see details below). I can login to the external ip, for a day, but then it stops
with internal 500 error. I cannot ssh or get into the instance, only alternate is to
create a new instance and re-do. What is the problem?
I have installed littlest jupyterhub using on this instance, using
#!/bin/bash
curl https://raw.githubusercontent.com/jupyterhub/the-littlest-jupyterhub/master/bootstrap/bootstrap.py | sudo python3 - --admin master
I would recommend you enable access on your instance to the serial console [1].
You will also need to setup a password for your user following this documentation [2].
With these two steps done, you should be able to reconnect to your instance once you are locked out like you mentioned by following this [3].
You should then be able to investigate what is going on in the instance.
Then try to verify if your application is still running, if the SSH server is still running etc.
Frederic
[1] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#enable_instance_access
[2] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#setting_up_a_local_password
[3] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#connectserialconsole

How do I resolve this error when trying to connect to an SQL server hosted on a Google Compute Engine Ubuntu VM

For a database course that I'm in, the professor has tasked us with setting up several VM MySQL servers and remote connections. I've found proper documentation to solve most of my problems, but I've pored over docs trying to find a solution to my latest issue.
I've set up an Ubuntu VM on the Google Cloud Compute Engine. I installed a MySQL server to this VM instance, and I need to log in remotely from my laptop. I've followed this documentation https://cloud.google.com/solutions/mysql-remote-access and this youtube video https://www.youtube.com/watch?v=f5qQDm3ciDg.
However, I still get an Unable to Connect to Server message when I test my connection. What could I be overlooking that will help me connect?
Thanks!
So, I slammed my head against a wall for long enough to realize that ssh will be an easier solution than a direct connection.
So, at least for my Windows machine, these are the steps I followed to make the connection:
Download the sql server (You don't need to add a user unless necessary, and you don't change the bind-address in the config file).
Use PuTTYgen to create a private public key pair. Export the private key as an openssh format (in the export options)
Click the edit button on your VM instance then scroll down to the SSH key section.
Paste the public key into the text box (be sure to change the last comment portion to a username on the Linus VM)
Use the SSH connection on MySQL Workbench. Use the external IP of your VM as the first (ssh) host name and localhost as the second (SQL) host name. Input all other info as it is asked for.

Google Compute Engine: Internal DNS server and issues with the resolving

Since google Compute engine does not provides internal DNS i created 2 centos bind machines which will do the resolving for the machines on GCE and forward the resolvings over vpn to my private cloud and vice versa.
as the google cloud help docs suggests you can have this kind of scenario. and edit the resolv.conf on each instance to do the resolving.
What i did was edit the ifcg-eth0 to disable the PEERDNS and in /etc/resolv.conf
i added the search domain and top 2 nameservrs my instances.
now after one instance gets rebooted..it wont start again because its searching for the metadata.google.internal domain
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
What is the best practice in this kind of scenarios?
ty
Also i need the internal DNS for to do the poor's man round-robin failover, since GCE does not provides internal balancers.
As mentioned at https://cloud.google.com/compute/docs/networking:
Each instance's metadata server acts as a DNS server. It stores the DNS entries for all network IP addresses in the local network and calls Google's public DNS server for entries outside the network. You cannot configure this DNS server, but you can set up your own DNS server if you like and configure your instances to use that server instead by editing the /etc/resolv.conf file.
So you should be able to just use 169.254.169.254 for your DNS server. If you need to define external DNS entries, you might like Cloud DNS. If you set up a domain with Cloud DNS, or any other DNS provider, the 169.254.169.254 resolver should find it.
If you need something more complex, such as customer internal DNS names, then your own BIND server might be the best solution. Just make sure that metadata.google.internal. resolves to 169.254.169.254.
OK, I just ran in to this.. but unfortunately there was no timeout after 30 minutes that got it working. Fortunatly nelasx had correctly diagnosed it, and given the fix. I'm adding this to give the steps I had to take based on his excellent question and commented answer. I've just pulled the info I had to gather together in one place, to get to a solution.
Symptoms: on startup of the google instance - getting connection refused
After inspecting serial console output, will see:
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
You could try waiting, didn't work for me, and inspection of https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google-startup-scripts/usr/share/google/onboot
# Failed to resolve host or connect to host. Retry indefinitely.
6|7) sleep 1.0
log "Waiting for metadata server, attempt ${count}"
Led me to believe that will not work.
So, the solution was to fiddle with the disk, to add in nelasx's solution:
"edit ifcfg-eth and change PEERDNS=no edit /etc/resolv.conf and put on top your nameservers + search domain edit /etc/hosts and add: 169.254.169.254 metadata.google.internal"
To do this,
Best to create a snapshot backup before you start in case it goes awry
Uncheck "Delete boot disk when instance is deleted" for your instance
Delete the instance
Create a micro instance
Mount the disk
sudo ls -l /dev/disk/by-id/* # this will give you the name of the instances
sudo mkdir /mnt/new
sudo mount /dev/disk/by-id/scsi-0Google_PersistentDisk_instance-1-part1 /mnt/new
where instance-1 will be changed as per your setup
Go in an edit as per nelasx's solution - idiot trap I fell for - use a relative path - don't just sudo vi /etc/hosts use /mnt/new/etc/hosts - that cost me 15 more minutes as I had to go through the: got depressed, scratched head, kicked myself cycle.
Delete the debug instance, ensuring your attached disk delete option is unchecked
Create a new instance matching your original with the edited disk as your boot disk and fire it up.

Cannot connect to Google Compute Instance user

I have searched all over google for this issue but found no answere yet, so I thought asking here.
I have a google compute instance and I had a running putty ssh connection that worked flawlessly. But after I formatted my PC, everything went wrong.
I installed gcloud and done the whole procces of ssh again (config-ssh, adding ssh to key list and trying to connect), also I was trying to connect to my old user after I realized that I typed a different name to my windows user name. Suddenly I got the No supported authentication message. So I thought something is wrong with the ssh keys, But I realized that I cannot connect to my user even through the google web browser window, the connection is always stuck on trying to connect until timeout.
I would gladly appreciate any help :)
gcloud compute ssh currently has a known problem and might not work on Windows.
Here's a workaround until we fix it: run "gcloud compute ssh INSTANCE --dry-run". This will output the command it tries to execute.
Copy that command. You can either add -W flag to it and run it, or replace ssh.exe with ssh-term.exe and remove the -o flags.
If gcloud is installed in a place like Program Files, you might also need to add "" around the path.
First of all, run the following command (replacing the word in capital letters) which will ensure that your SSH key is created if it was not created before: gcloud compute ssh INSTANCE
Then, follow these steps to add your SSH key to your project and SSH into your instance:
1- Copy the content of C:\Users\<username>\.ssh\ google_compute_engine.pub(might be different path based on each Windows version) into the project metadata (Developers Console -> PROJECT -> Compute -> Metadata -> SSH keys -> Edit -> Add key).
If you want to log in as a different user, you can do it modifying it in the last word of the pasted text: <username>#<hostname>
2- Configure Putty. Go to Connection -> SSH -> Auth -> Browse and select your Putty SSH key which should be located in C:\Users\<username>\.ssh\ google_compute_engine.ppk)and try to SSH into the instance.
3- If it doesn't work, remove the instance metadata because the instance metadata overrides the project metadata. To do that, go to Compute-> Compute Engine-> INSTANCE -> SSH keys -> Edit -> Click on every ‘x’ and save the changes.
Regarding your issue trying to access using the SSH button in the Developers Console, I’d reboot the instance if it’s not in production because there is a script that must be working properly in order to access from there: /usr/bin/python /usr/share/google/google_daemon/manage_accounts.py --daemon
I hope it helps.

Unable to create indexes in Sphinx after an emergency server restart [Can't create TCP/IP socket]

I'm trying to execute the command in the Windows console:
C:\SphinxSearch\bin\indexer --all --config C:\SphinxSearch\sphinx.conf
But I get an error:
ERROR: index 'indexname': sql_connect: Can't create TCP/IP socket
(10093) (DSN=mysql://root:*#localhost:3306/test).
A data source is mysql. Before the server restart everyone works fine.
How can I fix it?
I'm having the same error 10093. It's a windows error code by the way. In my case it occurs when trying to run the indexer through the system account via a scheduled task. If I'm running it directly as administrator, there's not a problem.
According to the site above:
Either your application hasn't called WSAStartup(), or WSAStartup() failed, or--possibly--you are accessing a socket which the current active task does not own (i.e. you're trying to share a socket between tasks).
In my case I'm thinking it might be the last one, some security problem due to user SYSTEM being used in my scheduled task. I was able to solve it by using my admin user instead: in the scheduled task, I set to use my local admin account with the option to "Run when user is logged on or not" and "Do not store password". I've also checked "Run with highest privileges". This seems to have done the trick as now my indexes are rotating on schedule.