Is there a way to restart pre-emptible Google Cloud VM instances (Compute Engine) on an hourly basis if it has stopped? - google-compute-engine

I have some science jobs to run where the instance is has a mounted network drive to a server that is always. Since the cost of always on VMs cost double, I wanted to run pre-emptible VMs but restart them hourly if they have stopped. I want the jobs to finish, but I also don't want to blow the budget on always on instances when I can have a little down time pay half the hourly rate.

Per this document, Compute Engine always terminates preemptible instances after they run for 24 hours. If you start or shut down the instance via the click “start” or “stop” button in the Google Cloud Console instance page, these actions will reset the 24-hour counter for preemptible instances.
If your science jobs always run longer than 24 hours, and you want to avoid jobs being interrupted, or if you want restart your preemptible instance hourly, you can use Cloud scheduler to start or stop your instance at a fixed time.

Related

How does Google Compute Engine decide what instances to shut down when autoscaling?

I'm creating a managed instance group with autoscaling in GCE. When a lot of work is queued up new instances will be created which start doing work.
Let's say each chunk of work takes 10 minutes, could it happen that GCE decides to shut down an instance that still has work in progress?
Autoscaler will immediately terminate instance if the health check condition meets.
However, you can use a shutdown script to control the termination. A shutdown script will run, on a best-effort basis, in the brief period between when the termination request is made and when the instance is actually terminated. During this period, Compute Engine will attempt to run your shutdown script to perform any tasks you provide in the script. You can read more about the autoscaler decision in this document. You can read about using shutdown script and its limitation at this link.
Also if these instances are offering backend service then it is good to enable connection draining. You can enable connection draining on backend services to ensure minimal interruption to your users when an instance is deleted automatically by an autoscaler or manually removed from an instance group. You can find more at this link about enabling connection draining.

Azure "MySQL server has gone away" for one minute only

I'm using Azure App Services to run about 15 PHP web apps. Most of these apps connect to my 'Azure Database for MySQL server' instance. This is a Basic-tier instance (1 vCore & 2GB memory).
The MySQL instance hosts about 30 small databases (ranging between 1 to 100MB in size).
The load on the MySQL instance is stable and low. CPU is constantly under 20%, memory is constantly under 50% and IO does not even show up in the metrics in the Azure Portal.
My problem is this:
Every once in a while the server goes offline for about 1 or 2 minutes (max 5 min). I see that client applications try to connect, they hang for a while to finally get the error:
SQLSTATE[HY000] [2006] MySQL server has gone away
It seems to happen randomly. Sometimes a few times a week or even a day. But sometimes it doesn't happen for weeks.
What's noticeable though, when it happens I see a downward spike in memory and an upward spike in CPU in the metrics graph on the portal like this:
Does anyone experience the same issue on Azure Database for MySQL? And did anyone find a solution?
I'm starting to think that it's caused by a resources movement on the Azure side but I don't have any evidence to back that up. If so, shouldn't that happen without any downtime?
Scaling up from the Basic 1 core tier with Compute Gen 4 to Basic 2 core tier with Compute Gen 5 seemed to resolve the problem.
Not sure though what was causing the issue though.
I started experiencing this error in May 2019.
If I happen to be connected on the mariadb server with ssh at the time it occurs and htop is running, I can see rsyslog suddenly going crazy. It bogs down the CPU and the network connection becomes unresponsive. The CPU and network activity doesn't show up in Azure but running w in the ssh session after the network recovers shows the CPU load was definitely very high during the last 15 minutes.
I traced it back to OMS agent. When that service is killed on the mariadb server, the server runs without any problem. As soon as OMS agent is started, "Mysql has gone away" pops up on the clients within 24 hours due to unresponsive network connection with the server machine.
It is possible to uninstall OMS agent from the Azure portal but it comes back within 48H.
The only way I found of getting rid of OMS agent is to stop walinuxagent too on the linux server.
Scaling the server up may solve the problem as you have more CPU power to process the extra CPU load induced by OMS agent. I prefer to kill OMS agent and walinuxagent instead of spending more money on an expansive server.
Edit:
It turns out OMS is installed because the VM is part of a Log Analytics workspace (search for Log Analytics workspaces in the search bar). Removing the VM from the workspace immediately uninstall OMS. There is no need to stop walinuxagent.

Why do my google cloud compute instances always unexpectedly restart?

Help! Help! Help!
It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!
The issue is right here:
all GPUs are in use
If you check the official documentation about GPU:
GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.
This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.
This sounds like Preemptible VM instance.
Preemptible instances function like normal instances, but have the following limitations:
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
Compute Engine always terminates preemptible instances after they run for 24 hours.
To check if your instance is preemptible using gcloud cli, just run
gcloud compute instances describe instance-name --format="(scheduling.preemptible)"
Result
scheduling:
preemptible: false
change "instance-name" to real name.
Or simply via UI, click on compute instance and scroll down:
To check for system operations performed on your instance, you can review it using following command:
gcloud compute operations list

Can not start an instance using custom images on Google Cloud Platform

I'm running my Production on GCP in an instance group with autoscaling on. Then, I tried to scale up my servers, but an instance couldn't start from my everyday backup image.
I checked the logs and I saw that my instance was inserted and automatically deleted.
Logs.
Then I tried to create an instance (not in any instance groups), I checked the logs, and I found that the instance was started after nearly 10 minutes.
Logs
I wanted to know why it took almost 10 minutes to start an instance from a custom image?

Managed VMs running Perl on Google App Engine

I have a perl job that runs for 5 mins at the top of every hour. What is the most cost effective way of running this job on the Google Cloud infrastructure? Running a compute engine VM seems too heavy-weight for this since I'd get charged for the other 55 mins of no use. I don't understand the "Managed VMs" well enough, but it seems like this might be an option, but I'm not sure if pricing is rounded to the hour. Does anyone have any ideas what the best option is so that I only get charged for 120 mins of usage (24 times run * 5 minutes). The script also uses some image processing binaries, so converting to Python won't do the trick.
Managed VMs are linked to Google App Engine. If you have an App in GAE, managed VMs are used to configure the hosting environment for you App using VMs that run on Google Compute Engine and these applications are subject to Java and Python run time. This link can give you an idea on pricing on GAE, however Perl is not a supported language in GAE.
On GCE, you can start up an instance, do the task and then delete the instance without deleting the persistence disk, this will allow you to recreate the instance using this disk, however you will still be charged for the provisioned disk space and you will need to create a script that will spin up the instance and delete it. You can also create a snapshot of your disk and recreate your instance based on the snapshot, this will be little bit less expensive that keeping the disk.
Also, you should look at the type of persistence disks (PD) on GCE, at this link, take a look at the examples provided, since based on your operation, regular PD or SSD PD can make a big difference on price.
You can use the pricing calculator to estimate your charges
When you deploy to App Engine using a managed VM, an compute engine instance (managed by google) is created for you. All request to App Engine will be forwarded to the created compute engine instance.
To run your script in App Engine as a Managed VM, you will have to dockerize your project, as the managed VM runs a docker container.
I don't see a reason to use App Engine managed VM (just for running a script), as the cost will be same as using a compute engine instance.
Probably the most cost effective way is to create a script that:
Launches a compute engine instance
Install perl
Copies your script to the instance
Runs you script in the created instance
To schedule the execution, you can put at home/office a cron job that executes the above script.