Cluster not responding, weird error message - google-compute-engine

My container engine cluster has a red exclamation mark next to its name in the Google cloud console overview of the container engine. A tooltip says "The cluster has a problem. Click the cluster name for details." Once I click the name I don't get any more infos, it's just the usual summary.
Stackdriver doesn't report anything unusual. No incidents are logged, all pods are marked as healthy but I can't reach my services.
Trying to get infos or logs via kubectl doesn't work:
kubectl cluster-info
Unable to connect to the server: dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
How can I debug this problem? And what does this cryptic message mean anyway?

Are you able to use other kubectl commands such as kubectl get pods?
This sounds like the cluster isn't set up correctly or there's some network issue. Would you also try kubectl config view to see how your cluster is configured? More specifically, look for current-context and clusters fields to see if your cluster is configured as expected.

In our case it was a billing issue. Someone mistakenly disabled the billing profile for our project. We re-enabled it and waited a while, after 20 - 30 mins the cluster came back up with no errors

Related

AWS RDS automatically stopping soon after it is started

I have created an RDS on AWS which initially shows the status of 'available' but when I use my sql client to connect to it I receive the error:
Status : Failure -Test failed: IO Error: Connection reset by peer, Authentication lapse 0 ms
Then when I check the status of the RDS online (AWS dashboard) it says 'stopping'.
When I try to start the RDS again it's status will go from 'starting' to 'stopping' after a couple of minutes and then eventually 'stopped'. I can't find anything online referring to an RDS automatically stopping and I am somewhat a novice to AWS.
Based on the comments.
The solution was found by checking CloudTrial Event history. Based on the search it was identified that StopDBInstance was issued by HIPComplianceWorker user.
This probably means that there is an automation that checks the db instances launched and verifies if they comply with your companies policies. Your instance could be violating such policies, and it was automatically stopped.
You would have to contact your admins to check with them what kind of RDS you can use.

Internal 500 error on Google Compute Engine, installing littlest jupyter

"Internal 500 server error" after VM runs for a day or two.
This is the second time it has happened, I start the instance, install littlest Jupyterhub
(see details below). I can login to the external ip, for a day, but then it stops
with internal 500 error. I cannot ssh or get into the instance, only alternate is to
create a new instance and re-do. What is the problem?
I have installed littlest jupyterhub using on this instance, using
#!/bin/bash
curl https://raw.githubusercontent.com/jupyterhub/the-littlest-jupyterhub/master/bootstrap/bootstrap.py | sudo python3 - --admin master
I would recommend you enable access on your instance to the serial console [1].
You will also need to setup a password for your user following this documentation [2].
With these two steps done, you should be able to reconnect to your instance once you are locked out like you mentioned by following this [3].
You should then be able to investigate what is going on in the instance.
Then try to verify if your application is still running, if the SSH server is still running etc.
Frederic
[1] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#enable_instance_access
[2] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#setting_up_a_local_password
[3] https://cloud.google.com/compute/docs/instances/interacting-with-serial-console#connectserialconsole

Terminating dataproc cluster with termination protection on instances produces red flag on cluster that never leaves; is cluster safe?

I need to give a dataproc cluster protection like one can give an AWS EMR cluster. I saw that VM protection is a thing (but can't find anything about dataproc cluster protection), so I decided to try that out.
I made a dataproc cluster, for every instance of which I turned on deletion protection.
As a test of the safety of this arrangement, I tried to delete the cluster from the command line. As a result, the cluster now has a red flag on it all the time. The message reads:
Invalid resource usage: 'Resource cannot be deleted if it's
protected against deletion.'.
My question is this: given the persistent error message, is the cluster still ok? Have I accomplished the cluster protection that I sought? As far as I can tell, everything is still alright, I just wondered if anyone knows more about the state of the management of the cluster in the presence of this scary red exclamation mark.
While your cluster is probably fine, it is now in error state and cannot be used for submitting jobs through the API, or updating.
Dataproc does not currently support delete protection. You can file a feature request here: https://issuetracker.google.com/issues/new?component=187133&template=0

Google Cloud instance can't be accessed via SSH after cloning

I'm desperate for help here. I have a compute engine instance that hosts a lot of websites. These are the steps that I took:
Go to Compute Engine > Snapshots and take a snapshot of my instance
Click on the newly created snapshot and click Create Instance.
The new instance has all the configs of the current running instance
Then when I tried to access the new instance via SSH, it wouldn't work. Error message:
"Connection Failed
We are unable to connect to the VM on port 22. Learn more about possible causes of this issue."
Clicking on Learn more gets me to https://cloud.google.com/compute/docs/ssh-in-browser#ssherror
The instance is booting up and sshd is not yet running - Not sure how to check this
The instance is not running sshd - Not sure how to check this either
sshd is listening on a port other than the one you are connecting to - My current instance is having ssh running on port 22 so I guess this is fine?
There is no firewall rule allowing SSH access on the port - Again, my current instance is having ssh running so I don't think it's because of firewall, right?
The firewall rule allowing SSH access is enabled, but is not configured to allow connections from GCP Console services. - Same as above
The instance is shut down - Instance is still running.
Strange thing is if I create a fresh instance from scratch and then do the steps above to clone to a new instance then that new instance can be accessed normally via SSH.
Can anyone show me how to fix this if possible? Or show me how to see logs, check for what went wrong etc as I tried to google but pretty confused with all the jargons or where to find a particular stuff. Sorry for the wall of text. Thanks
**
Edit #1
**: I got technical support from Google. The steps below might help someone else, but not me as when I reached step 7, I waited forever and couldn't get to the login page.
1.) Go to the VM instances page and click on the Instance name of your VM.
2.) Click the Edit button at the top of the page.
3.) Under Custom metadata, click Add item.
4.) Set 'Key' to 'startup-script' and set 'Value' to this script:
#! /bin/bash
useradd -G sudo USERNAME
echo 'USERNAME:PASSWORD' | chpasswd
NOTE: change the value of USERNAME and PASSWORD to the name and password of your choice.
5.) Enable "Enable connecting to serial ports" by checking the box below the SSH button.
6.) Click Save and then click RESET on the top of the page. Wait for some time for the instance to reboot.
7.) Click on 'Connect to serial port' in the page. In the new window, you might need to wait a bit and press on Enter of your keyboard once; then, you should see the login prompt.
8.) Login using the USERNAME and PASSWORD you provided.
Note: Please do not share any of your password and username for your data security.
As those steps above couldn't help me and the Google support representative looked at the log but didn't see anything wrong, she suggested to debug SSH following this guide https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh#use_your_disk_on_a_new_instance which I will do when I have time. Feel like I'm writing an essay. Will keep posted
The troubleshooting steps that you can follow are:
Use the serial console to view your instance logs and check whether the new instance you created from the snapshot failed to start to the appropriate run level where the ssh daemon would get started. If sshd was not started you would not have ssh access to your instance.
You can try restarting the instance if it doesn’t affect production and try to gain ssh access again. Might be that some issue prevented the instance from starting up properly and restarting it could fix it.
You can try creating another VM instance from the snapshot in case the previous instance wasn’t created properly.
If creating a new VM instance from the snapshot doesn’t fix the issue, it might be that the snapshot itself wasn’t created properly. You can read this documentation guide, section Understanding snapshot best practices, and try creating another snapshot and VM instances from it.
I had the same problem and after a lot of searching, I found an answer from user Peripheral from ServerFault that worked for me.
I found the fix for me. A recent update has a known issue where it removes the default gateway from the iptables. To fix it, I have to go to the instance and select Edit. Scroll down, and under Custom Metadata put the following:
key: startup-script
value: route add default gw <gatewayIP> eth0
Save and restart the VM.
Source
All credits to him/her, just want to share to help others find their solution faster.
I had the same issue. I eventually figured that it was because I attached a persistent disk added an entry into the /etc/fstab file. This entry is supposed to automatically mount the attached disk upon restart of the instance.
However, when I created a snapshot of the boot disk, I didn't remove the /etc/fstab entry. So creating a new instance from this snapshot will always cause a boot error as the script tries to mount a disk that is not attached.
This information is present in the documentation

Google Compute Engine: Internal DNS server and issues with the resolving

Since google Compute engine does not provides internal DNS i created 2 centos bind machines which will do the resolving for the machines on GCE and forward the resolvings over vpn to my private cloud and vice versa.
as the google cloud help docs suggests you can have this kind of scenario. and edit the resolv.conf on each instance to do the resolving.
What i did was edit the ifcg-eth0 to disable the PEERDNS and in /etc/resolv.conf
i added the search domain and top 2 nameservrs my instances.
now after one instance gets rebooted..it wont start again because its searching for the metadata.google.internal domain
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
What is the best practice in this kind of scenarios?
ty
Also i need the internal DNS for to do the poor's man round-robin failover, since GCE does not provides internal balancers.
As mentioned at https://cloud.google.com/compute/docs/networking:
Each instance's metadata server acts as a DNS server. It stores the DNS entries for all network IP addresses in the local network and calls Google's public DNS server for entries outside the network. You cannot configure this DNS server, but you can set up your own DNS server if you like and configure your instances to use that server instead by editing the /etc/resolv.conf file.
So you should be able to just use 169.254.169.254 for your DNS server. If you need to define external DNS entries, you might like Cloud DNS. If you set up a domain with Cloud DNS, or any other DNS provider, the 169.254.169.254 resolver should find it.
If you need something more complex, such as customer internal DNS names, then your own BIND server might be the best solution. Just make sure that metadata.google.internal. resolves to 169.254.169.254.
OK, I just ran in to this.. but unfortunately there was no timeout after 30 minutes that got it working. Fortunatly nelasx had correctly diagnosed it, and given the fix. I'm adding this to give the steps I had to take based on his excellent question and commented answer. I've just pulled the info I had to gather together in one place, to get to a solution.
Symptoms: on startup of the google instance - getting connection refused
After inspecting serial console output, will see:
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
You could try waiting, didn't work for me, and inspection of https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google-startup-scripts/usr/share/google/onboot
# Failed to resolve host or connect to host. Retry indefinitely.
6|7) sleep 1.0
log "Waiting for metadata server, attempt ${count}"
Led me to believe that will not work.
So, the solution was to fiddle with the disk, to add in nelasx's solution:
"edit ifcfg-eth and change PEERDNS=no edit /etc/resolv.conf and put on top your nameservers + search domain edit /etc/hosts and add: 169.254.169.254 metadata.google.internal"
To do this,
Best to create a snapshot backup before you start in case it goes awry
Uncheck "Delete boot disk when instance is deleted" for your instance
Delete the instance
Create a micro instance
Mount the disk
sudo ls -l /dev/disk/by-id/* # this will give you the name of the instances
sudo mkdir /mnt/new
sudo mount /dev/disk/by-id/scsi-0Google_PersistentDisk_instance-1-part1 /mnt/new
where instance-1 will be changed as per your setup
Go in an edit as per nelasx's solution - idiot trap I fell for - use a relative path - don't just sudo vi /etc/hosts use /mnt/new/etc/hosts - that cost me 15 more minutes as I had to go through the: got depressed, scratched head, kicked myself cycle.
Delete the debug instance, ensuring your attached disk delete option is unchecked
Create a new instance matching your original with the edited disk as your boot disk and fire it up.