How to make GCE instance stop when its deployed container finishes? - containers

I have a Docker container that performs a single large computation. This computation requires lots of memory and takes about 12 hours to run.
I can create a Google Compute Engine VM of the appropriate size and use the "Deploy a container image to this VM instance" option to run this job perfectly. However once the job is finished the container quits but the VM is still running (and charging).
How can I make the VM exit/stop/delete when the container exits?
When the VM is in its zombie mode only the stackdriver containers are left running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bfa2feb03180 gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1 "/entrypoint.sh /u..." 17 hours ago Up 17 hours stackdriver-logging-agent
161439a487c2 gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2 "/bin/sh -c /opt/s..." 17 hours ago Up 17 hours 8000/tcp stackdriver-metadata-agent
I create the VM like this:
gcloud beta compute --project=abc instances create-with-container vm-name \
--zone=us-central1-c --machine-type=custom-1-65536-ext \
--network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
--maintenance-policy=MIGRATE \
--service-account=xyz \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
--boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
--container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
--container-command=python3 \
--container-arg="a" --container-arg="b" --container-arg="c" \
--labels=container-vm=cos-stable-69-10895-71-0

When you create the VM, you'll need to give it write access to compute so you can delete the instance from within. You should also set container environment variables like gce_zone and gce_project_id at this time. You'll need them to delete the instance.
gcloud beta compute instances create-with-container {NAME} \
--container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
--service-account={SERVICE_ACCOUNT} \
--scopes=https://www.googleapis.com/auth/compute,...
...
Then within the container, whenever YOU determine your task is finished:
request an api token (im using curl for simplicity and DEFAULT gce service account)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"
This will respond with json that looks like
{
"access_token": "foobarbaz...",
"expires_in": 1234,
"token_type": "Bearer"
}
Take that access token and hit the instances.delete api endpoint (notice the environment variables)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME

Having grappled with the problem for some time, here's a full solution that works pretty well.
This solution doesn't use the "start machine with a container image" option. Instead it uses a startup script, which is more flexible. You still use a Container-Optimized OS instance.
Create a startup script:
#!/usr/bin/env bash
# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")
CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")
# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker
# Run! The logs will go to stack driver
sudo HOME=/home/root docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}
# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"
# Run compute delete on the current instance. Need to run in a container
# because COS machines don't come with gcloud installed
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME} --delete-disks=all --zone=${ZONE}
Put the script somewhere public. For example put it on Cloud Storage and create a public URL. You can't use a gs:// URI for a COS startup script.
Start an instance using a startup-script-url, and passing the image name and parameters, e.g.:
gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME
(You probably want to limit the scopes, the example uses full access for simplicity)

I wrote a self-contained Python function based on Vincent's answer.
def kill_vm():
"""
If we are running inside a GCE VM, kill it.
"""
# based on https://stackoverflow.com/q/52748332/321772
import json
import logging
import requests
# get the token
r = json.loads(
requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
headers={"Metadata-Flavor": "Google"})
.text)
token = r["access_token"]
# get instance metadata
# based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
headers={"Metadata-Flavor": "Google"}).text
name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
headers={"Metadata-Flavor": "Google"}).text
zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
headers={"Metadata-Flavor": "Google"}).text
zone = zone_long.split("/")[-1]
# shut ourselves down
logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))
requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
.format(project_id=project_id, zone=zone, name=name),
headers={"Authorization": "Bearer {token}".format(token=token)})
A simple atexit hook gets me my desired behavior:
import atexit
atexit.register(kill_vm)

Another solution is to not use GCE and instead use AI Platform's custom job service, which automatically shuts down the VM after the Docker container exits.
gcloud ai-platform jobs submit training $JOB_NAME \
--region $REGION \
--master-image-uri $IMAGE_URI
You can specify --master-machine-type.
See the GCP documentation on custom containers.

The simplest way, from within the container, once it's finished:
ZONE=`gcloud compute instances list --filter="name=($HOSTNAME)" --format 'csv[no-heading](zone)'`
gcloud compute instances delete $HOSTNAME --zone=$ZONE -q
-q skips the interactive confirmation
$HOSTNAME is already exported

Just use curl and the local metadata server (no need for Python scripts or gcloud). Add the following to the end of your Docker Entrypoint script, so it's run when the container finishes:
# Note: inside the container the name is exposed as $HOSTNAME
INSTANCE_NAME=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
INSTANCE_ZONE=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google")
echo "Terminating instance [${INSTANCE_NAME}] in zone [${INSTANCE_ZONE}}"
TOKEN=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google" | jq -r '.access_token')
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" https://www.googleapis.com/compute/v1/$INSTANCE_ZONE/instances/$INSTANCE_NAME
For security sake, and Principle of Least Privilege, you can run the VM with a custom service account, and give that service account a role, with this permission (a custom role is best).
compute.instances.delete

Related

GCS gcloud too much passphrase

I'm trying to script my VM creation and setup process.
Currently the script asks for my ssh passphrase multiple times.
Is there a way to enter the passphrase once at the beginning of the script and be done?
here's the first script:
gcloud -q compute instances create $VM_NAME \
--zone=$ZONE \
--machine-type=n1-standard-1 \
--image-project=ml-images \
--image-family=tf-1-14 \
--scopes=cloud-platform \
--boot-disk-size=24GB \
&& \
echo vm created \
&& \
gcloud -q compute scp --recurse \
~/altered-source/ $VM_NAME:~ \
--zone=$ZONE \
&& \
gcloud -q compute scp --recurse \
~/vm-scripts/ $VM_NAME:~ \
--zone=$ZONE \
&& \
echo files transfered \
&& \
gcloud -q compute ssh $VM_NAME \
--zone=$ZONE
Please provide some details of how you're attempting this.
By default, if you using gcloud, an ssh keypair will be generated automatically for you and stored in the metadata service so that you can ssh seamlessly, e.g.
gcloud compute instances create ${INSTANCE} ...
gcloud compute ssh ${INSTANCE} ... --command=....
Possible a better method to recreate the instance(s) programmatically, is to developer a startup script and then pass this to the instance during creation:
https://cloud.google.com/sdk/gcloud/reference/compute/instances/create#startup-script
Yo!
Definitively, use a Storage Bucket for that intra-transfer thing; you'll see you'll have better control and faster responses all the way.
If you really require to use a "third leg" perhaps using your local machine could work, you just need to install the SDK and use gcloud commands, it wont ask you for keys once you exchanged them between local and remote VMs, the caveat? you rely on your ISP Up/Down speeds, the good thing you know what, and how long is taking a file to upload.
Now, once again I suggest as other here already did, use a cloud bucket, that way you only need to refer your file as gs:///file and forget about the rest.
In any case here is some info about transferring files to instances.
Have an awesome day!
-JP

jdbc driver does not work in kubernetes, fails with timeout

I have a java 11 app with jdbc driver running together with mysql 8.0, the app is able to connect to mysql and execute one sql, but it looks like it never gets a response back?
It looks like a connectivity issue.
At first it'd be good to look at the Java program output.
First simple checks are at the Kubernetes level to ensure that key components are alive:
$ kubectl get deployments
$ kubectl get services
$ kubectl get pods
Additional checks could be done from within the container where your Java app is running.
A possible approach is below.
List deployments of your app and their labels:
$ kubectl get deployments --show-labels
NAME READY UP-TO-DATE AVAILABLE AGE LABELS
hello-node 2/2 2 2 1h app=hello-node
Having got the label, you can list the relevant pods and their containers:
$ LABEL=hello-node; kubectl get pods -l app=$LABEL -o custom-columns=POD:metadata.name,CONTAINER:spec.containers[*].name
POD CONTAINER
hello-node-55b49fb9f8-7tbh4 hello-node
hello-node-55b49fb9f8-p7wt6 hello-node
Now it's possible to run basic diagnostic commands from within the Java app container.
Ping might not achieve the target but is available almost always in container and does primitive check of DNS resolution.
Services from the same namespace should be available via short DNS name.
Services from other namespaces inside of the same Kubernetes cluster should be available via internal FQDN.
$ kubectl exec hello-node-55b49fb9f8-p7wt6 -c hello-node -- ping -c1 hello-node
$ kubectl exec hello-node-55b49fb9f8-p7wt6 -c hello-node -- ping -c1 hello-node.default.svc.cluster.local
$ kubectl exec hello-node-55b49fb9f8-p7wt6 -c hello-node -- mysql -u [username] -p [dbname] -e [query]
From here on the connectivity diagnostics is pretty similar to the bare-metal server except the fact that you are limited by the tools available inside of container. You might install missing packages into the container as needed.
As soon as you obtain more diagnostic information, you'll get a clue what to check next.

Use short lived token to push docker image to GCP

I have a service account and key json file contents in my process. Trying to spawn "docker gcr.io/my-project/my-image" to upload images to container registry.
I tried docker login -u _json_key -p "$(cat keyfile.json)" https://[HOSTNAME] from Advanced Authentication tutorial, which returns success during login, but still docker push returns error:
You do not currently have an active account selected.
Please run:
$ gcloud auth login
Ideally, I would like to trigger docker push without configuring gcloud SDK. Also would not want to store key json contents to a file. I'd like to keep it in process memory.
The correct command to run for docker clients 18.03 and newer is gcloud auth configure-docker.
If you read the fine print for your command docker login -u _json_key -p "$(cat keyfile.json)" https://[HOSTNAME] it mentions for older Docker clients e.g. several years old. This is not the correct command to run today.
With the constant improvements, new features, Kubernetes, etc. you do not want to be running old commands or configurations.
gcloud auth configure-docker

Cannot set maintenance policy on a instance template from the command line

I have tried to set in a new instances template the maintenance policy to "MIGRATE" and the automatic restart to "On" (as the Web Console does); but it ignores the flags.
This is the command I am using:
gcloud compute instance-templates create \
$TEMPLATE_NAME \
--boot-disk-size 50GB \
--image coreos-beta-681-0-0-v20150527 \
--image-project coreos-cloud \
--machine-type n1-standard-2 \
--metadata-from-file user-data=my-cloud-config.yml \
--scopes compute-rw,storage-full,logging-write \
--tags web-minion \
--maintenance-policy MIGRATE \
--boot-disk-type pd-standard
But the template is created with Automatic restart to "Off" and On host maintenance to "Terminate VM instance". Instances created from this template have also the same settings.
When I log HTTP requests and responses this appears in the create request:
{"automaticRestart": true, "onHostMaintenance": "MIGRATE"}
so it does not seem a client error.
How can I create templates with the same settings the Web Console uses?
EDIT: Version of gcloud: 0.9.61; Version of compute: 2015.05.19.
EDIT 2: This also occurs now in Developers Console; it's a regression because I had a template with the correct values before.
The issue has been fixed and now I can create templates with the correct settings.

Google Compute Engine: how to set hostname permanently?

How do I set the hostname of an instance in GCE permanently? I can set it via hostname,but after reboot it is gone again.
I tried to feed in metadata (hostname:f.q.d.n), but that did not do the job. But it should work via metadata (https://github.com/GoogleCloudPlatform/compute-image-packages/tree/master/google-startup-scripts).
Anybody an idea?
The most simple way to achieve it is creating a simple script and that's what I have done.
I have stored the hostname in the instance metadata and then I retrieve it every time the system restarts in order to set the hostname using a cron job.
$ gcloud compute instances add-metadata <instance> --metadata hostname=<new_hostname>
$ sudo crontab -e
And this is the line that must be appended in crontab
#reboot hostname $(curl --silent "http://metadata.google.internal/computeMetadata/v1/instance/attributes/hostname" -H "Metadata-Flavor: Google")
After these steps, every time you restart your instance it will have the hostname <new_hostname>.
You can check it in the prompt or with the command: hostname
You need to remove the file /etc/dhcp/dhclient.d/google_hostname.sh
rm -rf /etc/dhcp/dhclient.d/google_hostname.sh
rm -rf /etc/dhcp/dhclient-exit-hooks.d/google_set_hostname
It's worth noting that this script is needed in order to run gcloud beta compute instances create with the --hostname flag. If this script is absent on a base image, new VM instances will preserve the source hostname/FQDN!
Edit rc.local
sudo nano /etc/rc.local
Add your line under the rest:
hostname *your.hostname.com*
Make sure to run the following after for the script to be executed
chmod +x /etc/rc.d/rc.local
Reboot, and profit.
That isn't possible. Please take a look at this answer. The following article explains that the "hostname" is part of the default metadata entries and it is not possible to manually edit any of the default metadata pairs. As such, you would need to use a script or something else to change the hostname every time the system restarts, otherwise it will automatically get re-synced with the metadata server on every reboot.
You can find information on startup scripts for GCE in this article. You can visit this one for info on how to apply the script to an instance.
You can also create a simple startup-script to do the jobs:
$ gcloud compute instances add-metadata <instance-name> --zone <instance-zone> --metadata startup-script='#! /bin/bash
hostname <hostname>'
Notice that if you already have a startup-script you need to add to the existing startup-script below command or you will replace all the startup-script:
$ hostname instance-name
I was lucky to set hostname at GCE running CentOS.
Source: desantolo.com
Click EDIT on your instance
Go to "Custom metadata" section
Add hostname + your.hostname.tld (change "your.hostname.tld" to your actual hostname
run curl --silent "http://metadata.google.internal/computeMetadata/v1/instance/attributes/hostname" -H "Metadata-Flavor: Google"
run sudo env EDITOR=nano crontab -e to edit crontab
add line #reboot hostname $(curl --silent "http://metadata.google.internal/computeMetadata/v1/instance/attributes/hostname" -H "Metadata-Flavor: Google")
On your keyboard Ctrl + X
On your keyboard hit Y
On your keyboard hit Enter
run reboot
after system rebooted, run hostname and see if your changes applied
Good luck!
If anyone finds this solution does not work for them on GCS instance. Then I suggest you try using exit hooks as described by Google Support.
In fact, some distributions of Linux like CentOS and Debian use
dhclient-script script to configure the network parameters of the
machine. This script is invoked from time to time by dhclient which is
dynamic host configuration protocol client and provides a means for
configuring one or more network interfaces using the DHCP protocol,
BOOTP protocol, or if these protocols fail, by statically assigning an
address.
The following text is a quote from the man (manual) page of
dhclient-script:
After all processing has completed, /usr/sbin/dhclient-script
checks for the presence of an executable
/etc/dhcp/dhclient-exit-hooks script, which if present is invoked using the ´.´ command. The exit status of
dhclient-script will be passed to dhclient-exit-hooks in the exit_status shell variable, and will always be zero
if the script succeeded at the task for which it was invoked. The rest of the environment as described previ‐
ously for dhclient-enter-hooks is also present. The /etc/dhcp/dhclient-exit-hooks script can modify the valid of
exit_status to change the exit status of dhclient-script.
That being said, by taking a look into the code snippet of
dhclient-script, we can see the script checks for the existence of an
executable /etc/dhcp/dhclient-up-hooks script and all scripts in
/etc/dhcp/dhclient-exit-hooks.d/ directory.
ETCDIR="/etc/dhcp"
193 exit_with_hooks() {
194 exit_status="${1}"
195
196 if [ -x ${ETCDIR}/dhclient-exit-hooks ]; then
197 . ${ETCDIR}/dhclient-exit-hooks
198 fi
199
200 if [ -d ${ETCDIR}/dhclient-exit-hooks.d ]; then
201 for f in ${ETCDIR}/dhclient-exit-hooks.d/*.sh ; do
202 if [ -x ${f} ]; then
203 . ${f}204 fi
205 done
206 fi
207
208 exit ${exit_status}209 }
Therefore, in order to modify the hostname of your Linux VM you can
create a custom script with .sh extension and place it in
/etc/dhcp/dhclient-exit-hooks.d/ directory. If this directory does not
exist, you can create it. The content of the custom script will be:
hostname YourFQDN.sh
>
be sure to make this new .sh file executable:
chmod +x YourFQDN.sh
Source: (https://groups.google.com/d/msg/gce-discussion/olG_nXZ-Jaw/Y9HMl4mlBwAJ)
Im not sure I understand Adrián's answer. It seems overly complex since you have to run a script each boot why not just use hostname?
vi /etc/rc.local
add:
hostname your_hostname
thats it. tested and working. no need to fiddle with metadata and such.
Non-cron/metadata/script solution.
Edit /etc/dhclient-(network-interface).conf or create one if it doesn't exist.
Example:
sudo nano /etc/dhclient-eth0.conf
Then add the following line, replacing the desired FQDN between the double quotes:
supersede host-name "hostname.domain-name";
Persists between reboots and hostname and hostname -f works as intended.
Tested on Debian.
The dhclient sets the hostname using DHCP
You can override this by creating a custom hook script in /etc/dhcp/dhclient-exit-hooks.d/custom_set_hostname that would read the hostname from /etc/hostname:
if [ -f "/etc/hostname" ]; then
new_host_name=$(cat /etc/hostname)
fi
The script must have the execute permission.
It's important to set the new_host_name variable and not calling the hostname command directly as any call to the hostname command will be overriden by another hook or the dhclient-script which uses this variable
When creating a VM, you can specify a custom FQDN hostname as an optional parameter. This feature is currently in Beta.
$ gcloud beta compute instances create INSTANCE_NAME --hostname example.hostname
This should work across OSes, and prevent the need for workaround scripts.
More info in the docs.
-- Sirui (Product Manager, Google Compute Engine)
In my CentOS VMs I found that the script /etc/dhcp/dhclient.d/google_hostname.sh, installed by the google-compute-engine RPM, actually changed the hostname. This happens when the instance gets its IP address during boot.
While it's not the long-term solution I really want, for now I simply deleted this script. The hostname I set with hostnamectl now persists after a reboot.
The script is likely to be in exactly the same place in Debian/Ubuntu VMs, but of course I don't run any of those.
There is some hack you can do to achieve this as i did. Just do:
sudo chattr +i /etc/hosts
This command actually makes the file "(i)mmutable", which means even root can't change it (unless root does chattr -i /etc/hosts first, of course).
As above, you can undo this with sudo chattr -i /etc/hosts
Cheer!
An easy way to fix this is to set up a startup script with custom metadata.
Key :startup-script
Value:
#! /bin/bash
hostname <desired hostname>