NVIDIA Tesla K80 VM instance won't start on Google Cloud Compute - google-compute-engine

I created a VM instance on Google Cloud Compute attaching a NVIDIA Tesla K80 and using SSD for persistent storage.
I'm running Ubuntu 16.04 LTS on it and stopped the instance to prevent billing during no-usage time. Now I try to start the instance again but it won't start, neither from Console nor from Terminal (macOS).
I have already tried to view the instance's console port log, but it's not available as the instance is not running.

I would suggest checking the preemptibility of your instance and GPU. You can check whether your GPU is preemptive or not from the quota page. You can check the preemptibility of your instance by clicking on the instance and finding out whether Availability policy > Preemptibility is off or on, or by following this document.
Keep in mind, preemptible GPUs will only work on a preemptible instance.
If you find out that both the preemptibility matches, then it might be a project specific issue which will require one to one investigation. To get this support, you can open an issue in the public issue tracker so someone from Google can assist you.

Related

Why do my google cloud compute instances always unexpectedly restart?

Help! Help! Help!
It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!
The issue is right here:
all GPUs are in use
If you check the official documentation about GPU:
GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.
This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.
This sounds like Preemptible VM instance.
Preemptible instances function like normal instances, but have the following limitations:
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
Compute Engine always terminates preemptible instances after they run for 24 hours.
To check if your instance is preemptible using gcloud cli, just run
gcloud compute instances describe instance-name --format="(scheduling.preemptible)"
Result
scheduling:
preemptible: false
change "instance-name" to real name.
Or simply via UI, click on compute instance and scroll down:
To check for system operations performed on your instance, you can review it using following command:
gcloud compute operations list

Instance is overutilized. Consider switching to the machine type: g1-small

I created a new f1 micro instance with Ubuntu 16.04. I haven't logged in yet as I have not figured out how to create the SSH key-pair yet. But after two days, the Dashboard now shows:
Instance "xxx" is overutilized. Consider switching to the machine type: g1-small
Why is this happening? Isn't a f1 micro similar to an ec2 t1.nano? I have a t1.nano running a Node.js web site (with nginx, pm2, etc) and my CPU credit has been consistently at the maximum of 150 during this period with only me as a test user.
I started the f1 micro to run the same Node application to see which is more cost-effective. The parameter that was cloudy to me was that unexplained "0.2 virtual CPU". Is 0.2 CPU virtually unuseable? Would 0.5 (g1 small) be significantly better?
To address your connection problems, perhaps temporarily until you figure out the manual key management, you might want to try SSH from the browser which is possible from the Cloud Platform console or use gcloud CLI to assist you.
https://cloud.google.com/compute/docs/instances/connecting-to-instance
Once you get access via the terminal I would run 'top' or 'ps'.
Example of using ps to find the top CPU users:
ps wwaxr -o pid,stat,%cpu,time,command | head -10
Example of running top to find the top memory users:
top -l 1 -o rsize | head -20
Google Cloud also offers a monitoring product called Stackdriver which would give you this information in the Cloud console but it requires an agent to be running on your VM. See the getting started guide if this sounds like a good option for you.
https://cloud.google.com/monitoring/quickstart-lamp
Once you get access to the resource usage data you should be able to determine if 1) the VM isn't powerful enough to run your node.js server or 2) perhaps something else got started on the host unexpectedly and that's the source of your usage.

Having hard time to start GCE VM with GPU in us-east-1

We are having problem to start a simple GCE VM with GPU in us-central-1. I am wondering if anyone experience the same thing. The error we got is below:
Instance 'instance-group-2-vc37' creation failed: The resource 'projects/xxxxx-xxxx-858/zones/us-central1-a/acceleratorTypes/nvidia-tesla-k80' was not found (when acting as 'xxxxxxx#cloudservices.gserviceaccount.com')
Thanks
GCE doesn't offer GPUs in us-central1. The docs list which regions GPUs are available in.
Cloud ML Engine is a separate product and not what you are using here.

EC2 Instance is running very slow

I am running an EC2 Instance on Ubuntu Server machine. Tomcat and MySQL are installed and deployed java web-application on it since 1 month. It was running good with great performance for almost 1 month but now my application is responding very slow.
Also, point to note is: Earlier when I used to log into my Ubuntu Server through PuTTY, it was quick but now its taking time even when I enter Ubuntu password.
Is there any solution?
I would start with checking with memory/CPU/network availability to check if it is not bottleneck.
Try following commands:
To check memory availability:
free -m
To check CPU usage:
top
To check network usage:
ntop
To check disk usage:
df -h
To check disk io operations:
iotop
Please also check if when you disable your application you are able to quickly log in to that machine. If login is still slow, then you should contact your EC2 support complaining about poor performance and asking for assigning more resources for that machine.
You can use WAIT Tool to diagnose what is wrong with your server or your application. The tool will gather all information about CPU and memoru utilization, running threads etc.
In addition, I would definitely check Tomcat application server with VisualVM or some other profiler. For configuring JMX for Tomcat you can check article here.
For network monitoring - nload tool is worth your attention. You can launch it in screen so you always check network utilization stats when server is slown.
First check is there any application using too much cpu or memory. This can be checked by using top command. I'll tell you two simple shortcut keys that may be helpful while using top command. In top command result page, if you enter M it will sort application based on memory usage, from highest to lowest. If you enter P it will sort application based on cpu usage, from highest to lowest.
If you are unable to find any suspicious application using top you can use iotop it will show disk I/O usage details.
I was facing the same issue, the solution which worked for me was
Restart the ec2 instance
Edit
lately, I figure out this issue is happening due to the fewer resources (memory, CPU) available to the EC2 machine. So check available resources to the EC2 machine.

Get Memory and Cpu Usages in Google Cloud Compute

How is one supposed to access the CPU usages and Memory usages of all the instances in a given project in Google Cloud Compute?
I'm unable to find anything regarding this in the documentation.
You can use Google Cloud Monitoring to see some usage metrics for your systems, and the Google Cloud Monitoring agent to get more precise metrics like memory. See the GCP metrics documentation for a list of all available compute metrics.
For memory usage on Debian:
free -m
in console.
You can CPU, Memory, Network, Disk I/O related information of your instance group into Google Stackdriver. Stackdriver comes with a separate subscription. You can add charts to monitoring your GCP infrastructure into one single/multiple dashboard/s.
You can see detail information in How to monitor GCP infrastructure using stack driver visualization
if you are using Linux, you can install gnome-system-monitor and ssh -X into the system and launch it.
For Ubuntu on GCP:
"apt-get install gnome-system-monitor dbus"
From another linux machine (or if you have cygwin/x installed on windows), just "ssh -X {remote ip}" Then type gnome-system-monitor and it will launch on your desktop.
You could also setup a vncserver on the cloud platform.