I have some thousands of VM instances on Google Compute Engine. Almost all of them are stopped. How can I delete all the stopped instances at once?
(doing so on the UI will take ages, and furthermore - the UI crashes)
Thanks!
First get the list of VMs from your project:
gcloud compute instances list | grep TERMINATE
Verify that all these VMs need to be deleted. Then following generates the commands you can execute to delete them all. You can redirect the output to a file and then run "bash ". Feel free to optimize to a single command line if you are feeling lucky :)
gcloud compute instances list | grep TERMINATE | awk '{printf "gcloud comoute instances delete %s --zone %s\n", $1, $2}'
yes Y | gcloud compute instances list | awk '/TERMINATE/ {printf "gcloud compute instances delete %s --zone %s; ", $1, $2}' | bash
gcloud compute instances list : list instances one by one.
awk: prints out "gcloud compute instances delete "terminated_instance_name" --zone "zone name that instance belongs to"
Pipe this output to bash to make it execute on terminal;
And "yes Y" supplies a 'Y' or yes answer when prompted for confirmation.
terminated instances show in the gcloud console as stopped. assuming you wish to delete terminated instances, you can look for instances which have a status of TERMINATED.
the other answers here will work but they will iterate through all of your instances. a slightly cleaner approach is to request a filtered instance list from gcloud so that you only iterate through instances which are already known to be in this state.
finally, i've found that secondary disks don't always delete when their parent instance deletes, so i also like to purge orphaned disks during cleanup.
something like this should do it (for a bash shell):
#!/bin/bash
# delete terminated instances
for terminated_instance_uri in $(gcloud compute instances list --uri --filter="status:TERMINATED" 2> /dev/null); do
terminated_instance_name=${terminated_instance_uri##*/}
terminated_instance_zone_uri=${terminated_instance_uri/\/instances\/${terminated_instance_name}/}
terminated_instance_zone=${terminated_instance_zone_uri##*/}
if [ -n "${terminated_instance_name}" ] && [ -n "${terminated_instance_zone}" ] && gcloud compute instances delete ${terminated_instance_name} --zone ${terminated_instance_zone} --delete-disks all --quiet; then
echo "deleted: ${terminated_instance_zone}/${terminated_instance_name}"
fi
done
# delete orphaned disks (filter for disks without a user)
for orphaned_disk_uri in $(gcloud compute disks list --uri --filter="-users:*" 2> /dev/null); do
orphaned_disk_name=${orphaned_disk_uri##*/}
orphaned_disk_zone_uri=${orphaned_disk_uri/\/disks\/${orphaned_disk_name}/}
orphaned_disk_zone=${orphaned_disk_zone_uri##*/}
if [ -n "${orphaned_disk_name}" ] && [ -n "${orphaned_disk_zone}" ] && gcloud compute disks delete ${orphaned_disk_name} --zone ${orphaned_disk_zone} --quiet; then
echo "deleted: ${orphaned_disk_zone}/${orphaned_disk_name}"
fi
done
Related
I'm trying to script my VM creation and setup process.
Currently the script asks for my ssh passphrase multiple times.
Is there a way to enter the passphrase once at the beginning of the script and be done?
here's the first script:
gcloud -q compute instances create $VM_NAME \
--zone=$ZONE \
--machine-type=n1-standard-1 \
--image-project=ml-images \
--image-family=tf-1-14 \
--scopes=cloud-platform \
--boot-disk-size=24GB \
&& \
echo vm created \
&& \
gcloud -q compute scp --recurse \
~/altered-source/ $VM_NAME:~ \
--zone=$ZONE \
&& \
gcloud -q compute scp --recurse \
~/vm-scripts/ $VM_NAME:~ \
--zone=$ZONE \
&& \
echo files transfered \
&& \
gcloud -q compute ssh $VM_NAME \
--zone=$ZONE
Please provide some details of how you're attempting this.
By default, if you using gcloud, an ssh keypair will be generated automatically for you and stored in the metadata service so that you can ssh seamlessly, e.g.
gcloud compute instances create ${INSTANCE} ...
gcloud compute ssh ${INSTANCE} ... --command=....
Possible a better method to recreate the instance(s) programmatically, is to developer a startup script and then pass this to the instance during creation:
https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#startup-script
Yo!
Definitively, use a Storage Bucket for that intra-transfer thing; you'll see you'll have better control and faster responses all the way.
If you really require to use a "third leg" perhaps using your local machine could work, you just need to install the SDK and use gcloud commands, it wont ask you for keys once you exchanged them between local and remote VMs, the caveat? you rely on your ISP Up/Down speeds, the good thing you know what, and how long is taking a file to upload.
Now, once again I suggest as other here already did, use a cloud bucket, that way you only need to refer your file as gs:///file and forget about the rest.
In any case here is some info about transferring files to instances.
Have an awesome day!
-JP
I have a Docker container that performs a single large computation. This computation requires lots of memory and takes about 12 hours to run.
I can create a Google Compute Engine VM of the appropriate size and use the "Deploy a container image to this VM instance" option to run this job perfectly. However once the job is finished the container quits but the VM is still running (and charging).
How can I make the VM exit/stop/delete when the container exits?
When the VM is in its zombie mode only the stackdriver containers are left running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bfa2feb03180 gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1 "/entrypoint.sh /u..." 17 hours ago Up 17 hours stackdriver-logging-agent
161439a487c2 gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2 "/bin/sh -c /opt/s..." 17 hours ago Up 17 hours 8000/tcp stackdriver-metadata-agent
I create the VM like this:
gcloud beta compute --project=abc instances create-with-container vm-name \
--zone=us-central1-c --machine-type=custom-1-65536-ext \
--network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
--maintenance-policy=MIGRATE \
--service-account=xyz \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
--boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
--container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
--container-command=python3 \
--container-arg="a" --container-arg="b" --container-arg="c" \
--labels=container-vm=cos-stable-69-10895-71-0
When you create the VM, you'll need to give it write access to compute so you can delete the instance from within. You should also set container environment variables like gce_zone and gce_project_id at this time. You'll need them to delete the instance.
gcloud beta compute instances create-with-container {NAME} \
--container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
--service-account={SERVICE_ACCOUNT} \
--scopes=https://www.googleapis.com/auth/compute,...
...
Then within the container, whenever YOU determine your task is finished:
request an api token (im using curl for simplicity and DEFAULT gce service account)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"
This will respond with json that looks like
{
"access_token": "foobarbaz...",
"expires_in": 1234,
"token_type": "Bearer"
}
Take that access token and hit the instances.delete api endpoint (notice the environment variables)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME
Having grappled with the problem for some time, here's a full solution that works pretty well.
This solution doesn't use the "start machine with a container image" option. Instead it uses a startup script, which is more flexible. You still use a Container-Optimized OS instance.
Create a startup script:
#!/usr/bin/env bash
# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")
CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")
# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker
# Run! The logs will go to stack driver
sudo HOME=/home/root docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}
# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"
# Run compute delete on the current instance. Need to run in a container
# because COS machines don't come with gcloud installed
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME} --delete-disks=all --zone=${ZONE}
Put the script somewhere public. For example put it on Cloud Storage and create a public URL. You can't use a gs:// URI for a COS startup script.
Start an instance using a startup-script-url, and passing the image name and parameters, e.g.:
gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME
(You probably want to limit the scopes, the example uses full access for simplicity)
I wrote a self-contained Python function based on Vincent's answer.
def kill_vm():
"""
If we are running inside a GCE VM, kill it.
"""
# based on https://stackoverflow.com/q/52748332/321772
import json
import logging
import requests
# get the token
r = json.loads(
requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
headers={"Metadata-Flavor": "Google"})
.text)
token = r["access_token"]
# get instance metadata
# based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
headers={"Metadata-Flavor": "Google"}).text
name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
headers={"Metadata-Flavor": "Google"}).text
zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
headers={"Metadata-Flavor": "Google"}).text
zone = zone_long.split("/")[-1]
# shut ourselves down
logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))
requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
.format(project_id=project_id, zone=zone, name=name),
headers={"Authorization": "Bearer {token}".format(token=token)})
A simple atexit hook gets me my desired behavior:
import atexit
atexit.register(kill_vm)
Another solution is to not use GCE and instead use AI Platform's custom job service, which automatically shuts down the VM after the Docker container exits.
gcloud ai-platform jobs submit training $JOB_NAME \
--region $REGION \
--master-image-uri $IMAGE_URI
You can specify --master-machine-type.
See the GCP documentation on custom containers.
The simplest way, from within the container, once it's finished:
ZONE=`gcloud compute instances list --filter="name=($HOSTNAME)" --format 'csv[no-heading](zone)'`
gcloud compute instances delete $HOSTNAME --zone=$ZONE -q
-q skips the interactive confirmation
$HOSTNAME is already exported
Just use curl and the local metadata server (no need for Python scripts or gcloud). Add the following to the end of your Docker Entrypoint script, so it's run when the container finishes:
# Note: inside the container the name is exposed as $HOSTNAME
INSTANCE_NAME=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
INSTANCE_ZONE=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google")
echo "Terminating instance [${INSTANCE_NAME}] in zone [${INSTANCE_ZONE}}"
TOKEN=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google" | jq -r '.access_token')
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" https://www.googleapis.com/compute/v1/$INSTANCE_ZONE/instances/$INSTANCE_NAME
For security sake, and Principle of Least Privilege, you can run the VM with a custom service account, and give that service account a role, with this permission (a custom role is best).
compute.instances.delete
I am trying to get the first pod from within a deployment (filtered by labels) with status running - currently I could only achieve the following, which will just give me the first pod within a deployment (filtered by labels) - and not for sure a running pod, e.g. it might also be a terminating one:
kubectl get pod -l "app=myapp" -l "tier=webserver" -l "namespace=test"
-o jsonpath="{.items[0].metadata.name}"
How is it possible to
a) get only a pod list of "RUNNING" pods and (couldn't find anything here or on google)
b) select the first one from that list. (thats what I'm currently doing)
Regards
Update: I already tried the link posted in the comments earlier with the following:
kubectl get pod -l "app=ms-bp" -l "tier=webserver" -l "namespace=test"
-o json | jq -r '.items[] | select(.status.phase = "Running") | .items[0].metadata.name'
the result is 4x "null" - there are 4 running pods.
Edit2: Resolved - see comments
Since kubectl 1.9 you have the option to pass a --field-selector argument (see https://github.com/kubernetes/kubernetes/pull/50140). E.g.
kubectl get pod -l app=yourapp --field-selector=status.phase==Running -o jsonpath="{.items[0].metadata.name}"
Note however that for not too old kubectl versions many reasons to find a running pod are mute, because a lot of commands which expect a pod also accept a deployment or service and automatically select a corresponding pod. To quote from the documentation:
$ echo exec logs port-forward | xargs -n1 kubectl help | grep -C1 'service\|deploy\|job'
# Get output from running 'date' command from the first pod of the deployment mydeployment, using the first container by default
kubectl exec deploy/mydeployment -- date
# Get output from running 'date' command from the first pod of the service myservice, using the first container by default
kubectl exec svc/myservice -- date
--
# Return snapshot logs from first container of a job named hello
kubectl logs job/hello
# Return snapshot logs from container nginx-1 of a deployment named nginx
kubectl logs deployment/nginx -c nginx-1
--
Use resource type/name such as deployment/mydeployment to select a pod. Resource type defaults to 'pod' if omitted.
--
# Listen on ports 5000 and 6000 locally, forwarding data to/from ports 5000 and 6000 in a pod selected by the deployment
kubectl port-forward deployment/mydeployment 5000 6000
# Listen on port 8443 locally, forwarding to the targetPort of the service's port named "https" in a pod selected by the service
kubectl port-forward service/myservice 8443:https
(Note logs also accepts a service, even though an example is omitted in the help.)
The selection algorithm favors "active pods" for which a main criterion is having a status of "Running" (see https://github.com/kubernetes/kubectl/blob/2d67b5a3674c9c661bc03bb96cb2c0841ccee90b/pkg/polymorphichelpers/attachablepodforobject.go#L51).
Problem
I'm not sure if the -N argument is being saved. SGE Cluster. Everything works except for the -N argument.
Snakemake requires a valid -N call
It doesn't set the job name properly.
It always reverts to the default name. This is my call, which has the same results, with or without the -N argument.
snakemake --jobs 100 --drmaa "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1 -N {rule}.{wildcards}.varScan"
The only way I have found to influence the job name is to use --jobname.
snakemake --jobs 100 --drmaa "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1 -N {rule}.{wildcards}.varScan" --jobname "{rule}.{wildcards}.{jobid}"
Background
I've tried a variety of things. Usually I actually just use a cluster configuration file, but that isn't working either, so that's why in the code above, I ditched the file system to make sure it's the '-N' command which isn't being saved.
My usual call is:
snakemake --drmaa "{cluster.clusterSpec}" --jobs 10 --cluster-config input/config.json
1) If I use '-n' instead of '-N', I receive a workflow error:
drmaa.errors.DeniedByDrmException: code 17: ERROR! invalid option argument "-n"
2) If I use '-N', but give it an incorrect wildcard, say {rule.name}:
AttributeError: 'str' object has no attribute 'name'
3) I cannot use both --drmaa AND --cluster:
snakemake: error: argument --cluster/-c: not allowed with argument --drmaa
4) If I specify the {jobid} in the config.json file, then Snakemake doesn't know what to do with it.
RuleException in line 13 of /extscratch/clc/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/modules/mpileup/mpileupSPLIT:
NameError: The name 'jobid' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
EDIT Added #5 w/ Solution
5) I can set the job name using the config.json and just concatenate the jobid on afterwards in my snakemake call. That way I have a generic snakemake call (--jobname "{cluster.jobName}.{jobid}"), and a highly configurable and specific job name ({rule}-{wildcards.sampleMPUS}_chr{wildcards.chrMPUS}) which results in:
mpileupSPLIT-Pfeiffer_chr19.1.e7152298
The 1 is the Snakemake jobid according to the DAG.
The 7152298 is my cluster's job number.
2nd EDIT - Just tried v3.12, same thing. Concatenation must occur in snakemake call.
Alternative solution
I would also be okay with something like this:
snakemake --drmaa "{cluster.clusterSpec}" --jobname "{cluster.jobName}" --jobs 10 --cluster-config input/config.json
With my cluster file like this:
"mpileupSPLIT": {
"clusterSpec": "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1 -n {rule}.{wildcards}.varScan",
"jobName": "{rule}-{wildcards.sampleMPUS}_chr{wildcards.chrMPUS}.{jobid}"
}
Documentation Reviewed
I've read the documentation but I was unable to figure it out.
http://snakemake.readthedocs.io/en/latest/executable.html?-highlight=job_name#cluster-execution
http://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#snakefiles-cluster-configuration
https://groups.google.com/forum/#!topic/snakemake/whwYODy_I74
System
Snakemake v3.10.2 (Will try newest conda version tomorrow)
Red Hat Enterprise Linux Server release 5.4
SGE Cluster
Solution
Use '--jobname' in your snakemake call instead of '-N' in your qsub parameter submission
Setup your cluster config file to have a targetable parameter for the jobname suffix. In this case these are the overrides for my Snakemake rule named "mpileupSPLIT":
"mpileupSPLIT": {
"clusterSpec": "-V -S /bin/bash -o log/mpileup/mpileupSPLIT -e log/mpileup/mpileupSPLIT -l h_vmem=10G -pe ncpus 1",
"jobName": "{rule}-{wildcards.sampleMPUS}_chr{wildcards.chrMPUS}"
}
Utilize a generic Snakemake call which includes {jobid}. On a cluster (SGE), the 'jobid' variable contains both the Snakemake Job# and the Cluster Job#, both are valuable as the first corresponds to the Snakemake DAG and the later is for cluster logging. (E.g. --jobname "{cluster.jobName}.{jobid}")
EDIT Added solution to resolve post.
I use the SUN's SGE to submit my jobs into a cluster system. The problem is how to let the
computing machine find the environment variables in the host machine, or how to config the qsub script to make the computing machine load the environment variables in host machine?
The following is an script example, but it will say some errors, such as libraries not found:
#!/bin/bash
#
#$ -V
#$ -cwd
#$ -j y
#$ -o /home/user/jobs_log/$JOB_ID.out
#$ -e /home/user/jobs_log/$JOB_ID.err
#$ -S /bin/bash
#
echo "Starting job: $SGE_TASK_ID"
# Modify this to use the path to matlab for your system
/home/user/Matlab/bin/matlab -nojvm -nodisplay -r matlab_job
echo "Done with job: $SGE_TASK_ID"
The technique you are using (adding a -V) should work. One possibility since you are specifying the shell with -S is that grid engine is configured to launch /bin/bash as a login shell and your profile scripts are stomping all over the environment you are trying to pass to the job.
Try using qstat -xml -j on the job while it is queued/running to see what environment variables grid engine is trying to pass to the job.
Try adding an env command to the script to see what variables are set.
Try adding shopt -q login_shell;echo $? in the script to tell you if it is being run as a login shell.
To list out shells that are configured as login shells in grid engine try:
SGE_SINGLE_LINE=true qconf -sconf|grep ^login_shells
I think this issue is due to you didn't config BASH in the login_shells of SGE
check your login_shells by qconf -sconf and see if bash in there.
login_shells
UNIX command interpreters like the Bourne-Shell (see sh(1)) or the C-
Shell (see csh(1)) can be used by Grid Engine to start job scripts. The
command interpreters can either be started as login-shells (i.e. all
system and user default resource files like .login or .profile will be
executed when the command interpreter is started and the environment
for the job will be set up as if the user has just logged in) or just
for command execution (i.e. only shell specific resource files like
.cshrc will be executed and a minimal default environment is set up by
Grid Engine - see qsub(1)). The parameter login_shells contains a
comma separated list of the executable names of the command inter-
preters to be started as login-shells. Shells in this list are only
started as login shells if the parameter shell_start_mode (see above)
is set to posix_compliant.
Changes to login_shells will take immediate effect. The default for
login_shells is sh,csh,tcsh,ksh.
This value is a global configuration parameter only. It cannot be over-
written by the execution host local configuration.