Why do my google cloud compute instances always unexpectedly restart? - google-compute-engine

Help! Help! Help!
It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!

The issue is right here:
all GPUs are in use
If you check the official documentation about GPU:
GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.
This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.

This sounds like Preemptible VM instance.
Preemptible instances function like normal instances, but have the following limitations:
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
Compute Engine always terminates preemptible instances after they run for 24 hours.
To check if your instance is preemptible using gcloud cli, just run
gcloud compute instances describe instance-name --format="(scheduling.preemptible)"
Result
scheduling:
preemptible: false
change "instance-name" to real name.
Or simply via UI, click on compute instance and scroll down:
To check for system operations performed on your instance, you can review it using following command:
gcloud compute operations list

Related

NVIDIA Tesla K80 VM instance won't start on Google Cloud Compute

I created a VM instance on Google Cloud Compute attaching a NVIDIA Tesla K80 and using SSD for persistent storage.
I'm running Ubuntu 16.04 LTS on it and stopped the instance to prevent billing during no-usage time. Now I try to start the instance again but it won't start, neither from Console nor from Terminal (macOS).
I have already tried to view the instance's console port log, but it's not available as the instance is not running.
I would suggest checking the preemptibility of your instance and GPU. You can check whether your GPU is preemptive or not from the quota page. You can check the preemptibility of your instance by clicking on the instance and finding out whether Availability policy > Preemptibility is off or on, or by following this document.
Keep in mind, preemptible GPUs will only work on a preemptible instance.
If you find out that both the preemptibility matches, then it might be a project specific issue which will require one to one investigation. To get this support, you can open an issue in the public issue tracker so someone from Google can assist you.

Having hard time to start GCE VM with GPU in us-east-1

We are having problem to start a simple GCE VM with GPU in us-central-1. I am wondering if anyone experience the same thing. The error we got is below:
Instance 'instance-group-2-vc37' creation failed: The resource 'projects/xxxxx-xxxx-858/zones/us-central1-a/acceleratorTypes/nvidia-tesla-k80' was not found (when acting as 'xxxxxxx#cloudservices.gserviceaccount.com')
Thanks
GCE doesn't offer GPUs in us-central1. The docs list which regions GPUs are available in.
Cloud ML Engine is a separate product and not what you are using here.

Shutting down VM on Google Compute Engine always restarts

I set up a 1 node cluster on google container engine which I just intend to use for testing, so I want to be able keep it shutdown while I am not using it to keep my costs low. I can not however figure out why the VM continually restarts after I shut it down through the console. I have set the "Automatic Restarts" option to false on the VM.
The VM is a n1-standard-2 (2 vCPUs, 7.5 GB memory) with 2 standard persistent disks attached.
Has anyone else faced this issue, or have experience with how to set up GCE so that you can keep it offline while not in use? Thanks in advance for any help.
The VMs in GKE clusters are managed by what's called a Managed Instance Group, which ensures that there's always the expected number of nodes in your cluster. I'd guess that it's seeing that there isn't a VM running in your project and assuming that something's gone wrong, so it recreates it.
You could stop it from doing so by explicitly resizing the instance group down to 0. You can change the number of nodes in the cluster either via the Container Engine UI or by running gcloud container clusters resize $CLUSTERNAME --size=0.

EC2 Instance is running very slow

I am running an EC2 Instance on Ubuntu Server machine. Tomcat and MySQL are installed and deployed java web-application on it since 1 month. It was running good with great performance for almost 1 month but now my application is responding very slow.
Also, point to note is: Earlier when I used to log into my Ubuntu Server through PuTTY, it was quick but now its taking time even when I enter Ubuntu password.
Is there any solution?
I would start with checking with memory/CPU/network availability to check if it is not bottleneck.
Try following commands:
To check memory availability:
free -m
To check CPU usage:
top
To check network usage:
ntop
To check disk usage:
df -h
To check disk io operations:
iotop
Please also check if when you disable your application you are able to quickly log in to that machine. If login is still slow, then you should contact your EC2 support complaining about poor performance and asking for assigning more resources for that machine.
You can use WAIT Tool to diagnose what is wrong with your server or your application. The tool will gather all information about CPU and memoru utilization, running threads etc.
In addition, I would definitely check Tomcat application server with VisualVM or some other profiler. For configuring JMX for Tomcat you can check article here.
For network monitoring - nload tool is worth your attention. You can launch it in screen so you always check network utilization stats when server is slown.
First check is there any application using too much cpu or memory. This can be checked by using top command. I'll tell you two simple shortcut keys that may be helpful while using top command. In top command result page, if you enter M it will sort application based on memory usage, from highest to lowest. If you enter P it will sort application based on cpu usage, from highest to lowest.
If you are unable to find any suspicious application using top you can use iotop it will show disk I/O usage details.
I was facing the same issue, the solution which worked for me was
Restart the ec2 instance
Edit
lately, I figure out this issue is happening due to the fewer resources (memory, CPU) available to the EC2 machine. So check available resources to the EC2 machine.

Managed VMs running Perl on Google App Engine

I have a perl job that runs for 5 mins at the top of every hour. What is the most cost effective way of running this job on the Google Cloud infrastructure? Running a compute engine VM seems too heavy-weight for this since I'd get charged for the other 55 mins of no use. I don't understand the "Managed VMs" well enough, but it seems like this might be an option, but I'm not sure if pricing is rounded to the hour. Does anyone have any ideas what the best option is so that I only get charged for 120 mins of usage (24 times run * 5 minutes). The script also uses some image processing binaries, so converting to Python won't do the trick.
Managed VMs are linked to Google App Engine. If you have an App in GAE, managed VMs are used to configure the hosting environment for you App using VMs that run on Google Compute Engine and these applications are subject to Java and Python run time. This link can give you an idea on pricing on GAE, however Perl is not a supported language in GAE.
On GCE, you can start up an instance, do the task and then delete the instance without deleting the persistence disk, this will allow you to recreate the instance using this disk, however you will still be charged for the provisioned disk space and you will need to create a script that will spin up the instance and delete it. You can also create a snapshot of your disk and recreate your instance based on the snapshot, this will be little bit less expensive that keeping the disk.
Also, you should look at the type of persistence disks (PD) on GCE, at this link, take a look at the examples provided, since based on your operation, regular PD or SSD PD can make a big difference on price.
You can use the pricing calculator to estimate your charges
When you deploy to App Engine using a managed VM, an compute engine instance (managed by google) is created for you. All request to App Engine will be forwarded to the created compute engine instance.
To run your script in App Engine as a Managed VM, you will have to dockerize your project, as the managed VM runs a docker container.
I don't see a reason to use App Engine managed VM (just for running a script), as the cost will be same as using a compute engine instance.
Probably the most cost effective way is to create a script that:
Launches a compute engine instance
Install perl
Copies your script to the instance
Runs you script in the created instance
To schedule the execution, you can put at home/office a cron job that executes the above script.