GCE VM taken down randomly and re-inserted - google-compute-engine

I have a single GCE VM running behind an http load balancer. Its too early to have multiple VMs, saving money.
What happens is this VM seems to be deleted and reinserted randomly by GCE. My tiny web app dies for that duration. I see these in my VM logs:
compute.instances.delete
compute.instances.insert
Seems like GCE maintenance activity. Is there a way to avoid this?

Related

Creating a Staging VM in Google Compute Engine

I'm trying to set up a Staging VM for a site that's in production that I have just inherited. The site is running Wordpress/Woocommerce and has not been updated in a while. The VM it's hosted on is running an old version of PHP. Obviously, this all needs to be fixed up but I'm unfamiliar with GCP Compute Engine. Also any attempt to run backup/clone plugins crashes the site and requires a restore from the daily snapshot which is very annoying.
Is it possible to clone the VM/disk to a new instance, point that at a temporary domain, and test/update the site? I have been trying to do this for a while now without much luck any suggestions would be much appreciated. Thanks.
Creating a clone of an existing VM is possible and quite easy.
Create a snapshot of the VM. If possible stop the VM before doing this to ensure 100% accuracy - this way you will have exact snapshot of the drive without any errors. You can do it while the VM is running too if stopping it is out of the question.
Create a VM from the shapshot - select as a boot disk a snapshot that you've just created. Remember to assign a static public IP to this VM (unless you want it changed after VM restart and since you're going to do some configuration this would likely happen). You can change the VM's specs at this time too - nothing stops you from adding/removing CPU's, RAM etc. It may well be that your VM is underutilised and you can use something smaller to save costs. Or the opposite.
Start the machine. Now you can modify your WP configuration to point to a new domain. Depending on the SSL certificate - you can either use external one or the one provided by GCP (most convinient solution).
If you already own a domain you want to use for staging you can host it in Cloud DNS or at some other provider - just point it to the external IP you just reserved.
If you will be hosting your domain in the Cloud DNS then you will find necessary infomration in the documentation about managed zones (domains).
You can also consider creating a new VM and setting it as a template for creating a group of VM's (managed autoscaled group) and creating an external HTTPS load balancer in front of it. But this adds a little to the complexity so it's just my idea if you needed to handle a lot more traffic.

Azure "MySQL server has gone away" for one minute only

I'm using Azure App Services to run about 15 PHP web apps. Most of these apps connect to my 'Azure Database for MySQL server' instance. This is a Basic-tier instance (1 vCore & 2GB memory).
The MySQL instance hosts about 30 small databases (ranging between 1 to 100MB in size).
The load on the MySQL instance is stable and low. CPU is constantly under 20%, memory is constantly under 50% and IO does not even show up in the metrics in the Azure Portal.
My problem is this:
Every once in a while the server goes offline for about 1 or 2 minutes (max 5 min). I see that client applications try to connect, they hang for a while to finally get the error:
SQLSTATE[HY000] [2006] MySQL server has gone away
It seems to happen randomly. Sometimes a few times a week or even a day. But sometimes it doesn't happen for weeks.
What's noticeable though, when it happens I see a downward spike in memory and an upward spike in CPU in the metrics graph on the portal like this:
Does anyone experience the same issue on Azure Database for MySQL? And did anyone find a solution?
I'm starting to think that it's caused by a resources movement on the Azure side but I don't have any evidence to back that up. If so, shouldn't that happen without any downtime?
Scaling up from the Basic 1 core tier with Compute Gen 4 to Basic 2 core tier with Compute Gen 5 seemed to resolve the problem.
Not sure though what was causing the issue though.
I started experiencing this error in May 2019.
If I happen to be connected on the mariadb server with ssh at the time it occurs and htop is running, I can see rsyslog suddenly going crazy. It bogs down the CPU and the network connection becomes unresponsive. The CPU and network activity doesn't show up in Azure but running w in the ssh session after the network recovers shows the CPU load was definitely very high during the last 15 minutes.
I traced it back to OMS agent. When that service is killed on the mariadb server, the server runs without any problem. As soon as OMS agent is started, "Mysql has gone away" pops up on the clients within 24 hours due to unresponsive network connection with the server machine.
It is possible to uninstall OMS agent from the Azure portal but it comes back within 48H.
The only way I found of getting rid of OMS agent is to stop walinuxagent too on the linux server.
Scaling the server up may solve the problem as you have more CPU power to process the extra CPU load induced by OMS agent. I prefer to kill OMS agent and walinuxagent instead of spending more money on an expansive server.
Edit:
It turns out OMS is installed because the VM is part of a Log Analytics workspace (search for Log Analytics workspaces in the search bar). Removing the VM from the workspace immediately uninstall OMS. There is no need to stop walinuxagent.

GCP HTTPS load balancing: when is it safe to delete instance?

TLDR: What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
Details: I have a relatively standard setup: GCE instances in a managed instance group, global HTTPS load balancer in front of them pointed at a backend service with only the one managed instance group in it. Health checks are standard 5 seconds timeout, 5 seconds unhealthy threshold, 2 consecutive failures, 2 consecutive successes.
I deploy some new instances, add them to the instance group, and remove the old ones. After many minutes (10-15 min usually), I delete the old instances.
Every once in a while, I notice that deleting the old instances (which I believe are no longer receiving traffic) correlates with a sporadic 502 response to a client, which can be seen only in the load-balancer level logs:
I've done a bunch of logs correlation and tcpdumping and load testing to be fairly confident that this 502 is not being served by one of the new, healthy instances. In any case, my question is:
What is the upper-bound on how long I should wait to guarantee that a GCE instance has been removed from the load-balancing path and can be safely deleted?
I think what you are looking for is the connection draining feature. https://cloud.google.com/compute/docs/load-balancing/enabling-connection-draining
To answer my own question: it turns out that these 502s were not related to shutting down an instance, 10 minutes was plenty of time to remove an instance from the serving path. The 502s were caused by a race condition between nginx timeouts and GCP's HTTP(S) Load Balancer timeouts—I've written up a full blog post on it here: Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer

Google Compute Instance 100% CPU Utilisation

I am running n1-standard-1 (1 vCPU, 3.75 GB memory) Compute Instance , In my android app around 80 users are online write now and cpu Utilisation of instance is 99% and my app became less responsive. Kindly suggest me the workaround and If i need to upgrade , can I do that with same instance or new instance needs to be created.
Since your app is running already and users are connecting to it, you don't want to do the following process:
shut down the VM instance, keeping the boot disk and other disks
boot a more powerful instance, using the boot disk from step (1)
attach and mount any additional disks, if applicable
Instead, you might want to do the following:
create an additional VM instance with similar software/configuration
create a load balancer and add both the original and new VM to it as a backend
change your DNS name to point to the load balancer IP instead of the original VM instance
Now, your users will be randomly sent to a VM that's least-loaded to see the application, and you can add more VMs if your traffic increases.
You did not describe your application in detail, so it's unclear if each VM has local state (e.g., runs a database) or there's a database running externally. You will still need to figure out how to manage the stateful systems such as database or user-uploaded data from all the VM instances, which is hard to advise on given the little information in your quest.

Compute Engine VM instance group got wiped out?

I'm new to GCE and want to migrate my web site there. I created a VM instance group hoping. I installed all the packages and set it up a couple days ago. But today I noticed my VM instance group has a different name (postfix, to be exact), and the disk is flushed empty. Is it possible to restore its status, or at least make sure it won't get wiped out again? I'm so surprised that GCE wiped out everything and I wonder if I'm missing something during setup.
A few details in case they are related:
I'm using a trusty image for the VM.
The cloud storage is chosen to be a regular persistent disk.
It was working with emphemeral IP, and yesterday I started to use Cloud DNS to host my domain. I should have used a static IP, but that mistake shouldn't cause the VM instance group to be flushed...
I'm using cloud sql as the database service.
Maybe I should just use VM instance, given I don't have much traffic now?
Any help will be greatly appreciated~