Google Compute Engine: Internal DNS server and issues with the resolving - google-compute-engine

Since google Compute engine does not provides internal DNS i created 2 centos bind machines which will do the resolving for the machines on GCE and forward the resolvings over vpn to my private cloud and vice versa.
as the google cloud help docs suggests you can have this kind of scenario. and edit the resolv.conf on each instance to do the resolving.
What i did was edit the ifcg-eth0 to disable the PEERDNS and in /etc/resolv.conf
i added the search domain and top 2 nameservrs my instances.
now after one instance gets rebooted..it wont start again because its searching for the metadata.google.internal domain
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
What is the best practice in this kind of scenarios?
ty
Also i need the internal DNS for to do the poor's man round-robin failover, since GCE does not provides internal balancers.

As mentioned at https://cloud.google.com/compute/docs/networking:
Each instance's metadata server acts as a DNS server. It stores the DNS entries for all network IP addresses in the local network and calls Google's public DNS server for entries outside the network. You cannot configure this DNS server, but you can set up your own DNS server if you like and configure your instances to use that server instead by editing the /etc/resolv.conf file.
So you should be able to just use 169.254.169.254 for your DNS server. If you need to define external DNS entries, you might like Cloud DNS. If you set up a domain with Cloud DNS, or any other DNS provider, the 169.254.169.254 resolver should find it.
If you need something more complex, such as customer internal DNS names, then your own BIND server might be the best solution. Just make sure that metadata.google.internal. resolves to 169.254.169.254.

OK, I just ran in to this.. but unfortunately there was no timeout after 30 minutes that got it working. Fortunatly nelasx had correctly diagnosed it, and given the fix. I'm adding this to give the steps I had to take based on his excellent question and commented answer. I've just pulled the info I had to gather together in one place, to get to a solution.
Symptoms: on startup of the google instance - getting connection refused
After inspecting serial console output, will see:
Jul 8 10:17:14 instance-1 google: Waiting for metadata server, attempt 412
You could try waiting, didn't work for me, and inspection of https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google-startup-scripts/usr/share/google/onboot
# Failed to resolve host or connect to host. Retry indefinitely.
6|7) sleep 1.0
log "Waiting for metadata server, attempt ${count}"
Led me to believe that will not work.
So, the solution was to fiddle with the disk, to add in nelasx's solution:
"edit ifcfg-eth and change PEERDNS=no edit /etc/resolv.conf and put on top your nameservers + search domain edit /etc/hosts and add: 169.254.169.254 metadata.google.internal"
To do this,
Best to create a snapshot backup before you start in case it goes awry
Uncheck "Delete boot disk when instance is deleted" for your instance
Delete the instance
Create a micro instance
Mount the disk
sudo ls -l /dev/disk/by-id/* # this will give you the name of the instances
sudo mkdir /mnt/new
sudo mount /dev/disk/by-id/scsi-0Google_PersistentDisk_instance-1-part1 /mnt/new
where instance-1 will be changed as per your setup
Go in an edit as per nelasx's solution - idiot trap I fell for - use a relative path - don't just sudo vi /etc/hosts use /mnt/new/etc/hosts - that cost me 15 more minutes as I had to go through the: got depressed, scratched head, kicked myself cycle.
Delete the debug instance, ensuring your attached disk delete option is unchecked
Create a new instance matching your original with the edited disk as your boot disk and fire it up.

Related

VerneMQ plugin_chain_exhausted Authentication MySQL

I have a running instance of VerneMQ (cluster of 2 nodes) on Google kubernets and using MySQL (CloudSQL) for Auth. Server accepts connections over TLS
It works fine, but after a few days i start seeing this message on the log:
can't authenticate client {[],<<"Client-id">>} from X.X.X.X:16609 due to plugin_chain_exhausted
The client app (paho) complains that the server refused the connection for being "not authorized (code=5 in paho error)"
after a few retry it finally connects. but every time it get's harder and harder until it just won't connect anymore
If i restart VerneMQ everything get's back to normal
I have only 3 clients currently connected at most, at the same time.
clients already connected have no issues in pub/sub.
In my configuration i have (among other things):
log.console.level=debug
plugins.vmq_diversity=on
vmq_diversity.mysql.* = all of them set
allow_anonymous=off
vmq_diversity.auth_mysql.enabled=on
it's like the server degrades over time. the status webpage reports no problem
My verne server was build from the git repository about a month ago and runs on a docker container
what could be the cause?
what else could i check to find posibles causes? maybe a diversity missconfiguration?
Tks
To quickly explain the plugin_chain_exhausted log: with Verne you can run multiple authentication/authorization plugins, and they will be checked in a chain. If one plugin allows the client, it will be in. If no plugin allows the client, you'll see the log above.
This does not explain the behaviour you describe, though. I don't think I have seen that.
In any case, the first thing to check is whether you actually run multiple plugins. For instance: have you disabled the vmq.passwd and the vmq.acl plugins?

Google Cloud instance can't be accessed via SSH after cloning

I'm desperate for help here. I have a compute engine instance that hosts a lot of websites. These are the steps that I took:
Go to Compute Engine > Snapshots and take a snapshot of my instance
Click on the newly created snapshot and click Create Instance.
The new instance has all the configs of the current running instance
Then when I tried to access the new instance via SSH, it wouldn't work. Error message:
"Connection Failed
We are unable to connect to the VM on port 22. Learn more about possible causes of this issue."
Clicking on Learn more gets me to https://cloud.google.com/compute/docs/ssh-in-browser#ssherror
The instance is booting up and sshd is not yet running - Not sure how to check this
The instance is not running sshd - Not sure how to check this either
sshd is listening on a port other than the one you are connecting to - My current instance is having ssh running on port 22 so I guess this is fine?
There is no firewall rule allowing SSH access on the port - Again, my current instance is having ssh running so I don't think it's because of firewall, right?
The firewall rule allowing SSH access is enabled, but is not configured to allow connections from GCP Console services. - Same as above
The instance is shut down - Instance is still running.
Strange thing is if I create a fresh instance from scratch and then do the steps above to clone to a new instance then that new instance can be accessed normally via SSH.
Can anyone show me how to fix this if possible? Or show me how to see logs, check for what went wrong etc as I tried to google but pretty confused with all the jargons or where to find a particular stuff. Sorry for the wall of text. Thanks
**
Edit #1
**: I got technical support from Google. The steps below might help someone else, but not me as when I reached step 7, I waited forever and couldn't get to the login page.
1.) Go to the VM instances page and click on the Instance name of your VM.
2.) Click the Edit button at the top of the page.
3.) Under Custom metadata, click Add item.
4.) Set 'Key' to 'startup-script' and set 'Value' to this script:
#! /bin/bash
useradd -G sudo USERNAME
echo 'USERNAME:PASSWORD' | chpasswd
NOTE: change the value of USERNAME and PASSWORD to the name and password of your choice.
5.) Enable "Enable connecting to serial ports" by checking the box below the SSH button.
6.) Click Save and then click RESET on the top of the page. Wait for some time for the instance to reboot.
7.) Click on 'Connect to serial port' in the page. In the new window, you might need to wait a bit and press on Enter of your keyboard once; then, you should see the login prompt.
8.) Login using the USERNAME and PASSWORD you provided.
Note: Please do not share any of your password and username for your data security.
As those steps above couldn't help me and the Google support representative looked at the log but didn't see anything wrong, she suggested to debug SSH following this guide https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh#use_your_disk_on_a_new_instance which I will do when I have time. Feel like I'm writing an essay. Will keep posted
The troubleshooting steps that you can follow are:
Use the serial console to view your instance logs and check whether the new instance you created from the snapshot failed to start to the appropriate run level where the ssh daemon would get started. If sshd was not started you would not have ssh access to your instance.
You can try restarting the instance if it doesn’t affect production and try to gain ssh access again. Might be that some issue prevented the instance from starting up properly and restarting it could fix it.
You can try creating another VM instance from the snapshot in case the previous instance wasn’t created properly.
If creating a new VM instance from the snapshot doesn’t fix the issue, it might be that the snapshot itself wasn’t created properly. You can read this documentation guide, section Understanding snapshot best practices, and try creating another snapshot and VM instances from it.
I had the same problem and after a lot of searching, I found an answer from user Peripheral from ServerFault that worked for me.
I found the fix for me. A recent update has a known issue where it removes the default gateway from the iptables. To fix it, I have to go to the instance and select Edit. Scroll down, and under Custom Metadata put the following:
key: startup-script
value: route add default gw <gatewayIP> eth0
Save and restart the VM.
Source
All credits to him/her, just want to share to help others find their solution faster.
I had the same issue. I eventually figured that it was because I attached a persistent disk added an entry into the /etc/fstab file. This entry is supposed to automatically mount the attached disk upon restart of the instance.
However, when I created a snapshot of the boot disk, I didn't remove the /etc/fstab entry. So creating a new instance from this snapshot will always cause a boot error as the script tries to mount a disk that is not attached.
This information is present in the documentation

Google Compute VM hacked, now what?

I've been running my Google Compute VM for literally 1 day, and I was hacked, by this IP: http://www.infobyip.com/ip-121.8.187.25.html
I'm trying to understand what I can do next (user connected via ssh, root password was changed), to avoid these types of attacks (and to understand more than what /var/log/auth.log is telling me) ?
I assume you deleted the instance already, right ? from Developers console.
As suggested, always use ssh rsa keys to connect to your instance, instead of passwords. Additionally, depending on where you want access from, you can only allow certain IPs through the firewall. Configuring the firewall along with iptables, gives you better security.
You may also want to take a look at sshguard. Sshguard will add iptables rules automatically when it detects a number of failed connection attempts.
Just to make sure, please change the default port 22 in /etc/ssh/sshd_config to something else.

First connect from Prestashop to Google Cloud SQL always fails

I'm setting up a PrestaShop installation on a development server which is a GCE instance and using Cloud SQL as a database server. Everything works just fine except one thing: whenever there is a long period of inactivity on the site, the first page load after that always gives me this error:
Link to database cannot be established: SQLSTATE[HY000] [2003]
If I refresh the page the error is gone and never appears again until I stop using the site for an hour or so. It almost looks like database instance is going into sleep mode or something like that.
The reason I mentioned Prestashop is the fact that I never get this error when using Adminer or connecting to the database from mysql console client.
With the per use billing model, instances are spun down after a 15 minute timeout to save you money. They then take a few seconds to be spun up when next accessed. It may be the Prestashop is timing out on these first requests (though I have no experience with that application).
Try changing your instance to a package billing, which has a 12 hour timeout, to see if this helps
https://developers.google.com/cloud-sql/faq#how_usage_calculated
According to GCE documentation,
Once a connection has been established with an instance, traffic is permitted in both directions over that connection, until the connection times out after 10 minutes of inactivity
I suspect that might be the cause. To get around it, you can try to lower the tcp keepalive time.
Refer here: https://cloud.google.com/sql/docs/compute-engine-access
To keep long-lived unused connections alive, you can set the TCP keepalive. The following commands set the TCP keepalive value to one minute and make the configuration permanent across instance reboots.
# Display the current tcp_keepalive_time value.
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
# Set tcp_keepalive_time to 60 seconds and make it permanent across reboots.
$ echo 'net.ipv4.tcp_keepalive_time = 60' | sudo tee -a /etc/sysctl.conf
# Apply the change.
$ sudo /sbin/sysctl --load=/etc/sysctl.conf
# Display the tcp_keepalive_time value to verify the change was applied.
$ cat /proc/sys/net/ipv4/tcp_keepalive_time

Trouble setting up witness in SQL Server mirroring scheme w/ error

I've got a trio of Windows servers (data1, data2 and datawitness) that aren't part of any domain and don't use AD. I'm trying to set up mirroring based on the instructions at http://alan328.com/SQL2005_Database_Mirroring_Tutorial.aspx. I've had success right up until the final set of instructions where I tell data1 to use datawitness as the witness server. That step fails with the following message:
alter database MyDatabase set witness = 'TCP://datawitness.somedomain.com:7024'
The ALTER DATABASE command could not be sent to the remote server instance 'TCP://datawitness.somedomain.com:7024'. The database mirroring configuration was not changed. Verify that the server is connected, and try again.
I've tested both port 7024 as well as 1433 using telnet and both servers can indeed connect with each other. I'm also able to add a connection to the witness server from SQL Server Manager on the primary server. I've used the Configuration Manager on both servers to enabled Named Pipes and verify that IP traffic is enabled and using port 1433 by default.
What else could it be? Do I need any additional ports open for this to work? (The firewall rules are very restrictive, but I know traffic on the previously mentioned ports is explicitly allowed)
Caveats that are worth mentioning here:
Each server is in a different network segment
The servers don't use AD and aren't part of a domain
There is no DNS server configured for these servers, so I'm using the HOSTS file to map domain names to IP addresses (verified using telnet, ping, etc).
The firewall rules are very restrictive and I don't have direct access to tweak them, though I can call in a change if needed
Data1 and Data2 are using SQL Server 2008, Datawitness is using SQL Express 2005. All of them use the default instance (i.e. none of them are named instances)
After combing through blogs and KB articles and forum posts and reinstalling and reconfiguring and rebooting and profiling, etc, etc, etc, I finally found the key to the puzzle - an entry in the event log on the witness server reported this error:
Database mirroring connection error 2 'DNS lookup failed with error: '11001(No such host is known.)'.' for 'TCP://ABC-WEB01:7024'.
I had used a hosts file to map mock domain names for all three servers in the form of datax.mydomain.com. However, it is now apparent that the witness was trying to comunicate back using the name of the primary server, which I did not have a hosts entry for. Simply adding another entry for ABC-WEB01 pointing to the primary web server did the trick. No errors and the mirroring is finally complete.
Hope this saves someone else a billion hours.
I'd like to add one more sub answer to this specific question, as my comment on Chris' answer shows, my mirror was showing up as disconnected (to the witness) Apperently you need to reboot (or in my case i just restarded the service) the witness server.
As soon as i did this the mirror showed the Witness connection as Connected!
See: http://www.bigresource.com/Tracker/Track-ms_sql-cBsxsUSH/