make az cli synapse resume/pause fault tolerant - azure-cli

In order to save money, I'm using az synapse sql pool pause and az synapse sql pool resume, so that the Synapse dedicated pool database only turns on to run tests when there is a Pull Request, then shuts down after.
The challenge is that:
az synapse sql pool resume fails is the database is already resumed, and
az synapse sql pool pause fails if the database is already paused
below is an example output of what happens when one of the above situation occurs
Command group 'synapse' is in preview and under development.
Reference and support levels: https://aka.ms/CLI_refstatus
Deployment failed.
Correlation ID: 062ea436-f0f0-4c11-a3e4-4df92cdaf6b5.
An unexpected error occured while processing the request.
Tracking ID: 'ba5ce906-5631-42ad-b3f4-a659095bdbe3'
Exited with code exit status 1
How can I make this command tolerate the state already being achieved?

here's how you would resume the database if and only if it were currently paused, then wait for it to be resumed/"Online". requires 3 separate commands...
state=$(az synapse sql pool show \
--name $DB_NAME \
--resource-group $RG_NAME \
--workspace-name $WKSPC_NAME \
--query "status")
if [ "$state" = "Paused" ]; then
echo "Resuming pool!"
az synapse sql pool resume
--name $DB_NAME
--resource-group $RG_NAME
--workspace-name $WKSPC_NAME
az synapse sql pool wait
--sql-pool-name $DB_NAME
--resource-group $RG_NAME
--workspace-name $WKSPC_NAME
--custom "state==Online"
fi

Related

Azure cli - wait for operation to complete before next statement executes

How do I make my cli script wait until a resource is provisioned before attempting the next operation?
For example I am creating a waf policy and then attempting to assign that waf policy to an app gateway
The issue is that the waf policy is still being created
# WAF Policy
az network application-gateway waf-policy create \
--name $wafPolicyName
--resource-group $resourceGroupName
# App Gateway - wont allow to create without a private IP
az network application-gateway create \
--name $appGatewayName \
--resource-group $resourceGroupName \
--waf-policy $wafPolicyName
This result is an error:- Another operation on this or dependent resource is in progress
How do I make it wait?
Try using this: az network application-gateway waf-policy wait --name $wafPolicyName --resource-group $resourceGroupName --created
Here is a link about how to use az network application-gateway waf-policy wait.
The wait command works perfectly on my side:

Docker seems to be killed before my shutdown-script is executed when my is VM preempted on GCE

I want to implement a shutdown-script that would be called when my VM is going to be preempted on Google Compute Engine. That VM is used to run dockers containers that execute long running batches, so I send them a signal to make them gracefully exit.
That shutting-down script is working well when I execute it manually, yet it breaks on a real premption use-case, or when I kill the VM by myself.
I got this error:
... logs from my containers ...
A 2019-08-13T16:54:07.943153098Z time="2019-08-13T16:54:07Z" level=error msg="error waiting for container: unexpected EOF"
(just after this error, I can see what I put in the 1st line of my shutting-down script, see code below)
A 2019-08-13T16:54:08.093815210Z 2019-08-13 16:54:08: Shutting down! TEST SIGTERM SHUTTING DOWN (this is the 1st line of my shuttig-down script)
A 2019-08-13T16:54:08.093845375Z docker ps -a
(no reult)
A 2019-08-13T16:54:08.155512145Z ps -ef
... a lot of things, but nothing related to docker ...
2019-08-13 16:54:08: task_subscriber not running, shutting down immediately.
I use preemptible VM from GCE, with image Container-Optimized OS 73-11647.267.0 stable. I run my dockers as service with systemctl, yet I don't thik this is related - [edit] Actually I could solve my issue thanks to this.
Right now, I am pretty sure that a lot of things happens when Google send the ACPI signal to my VM, even before the shutdown-script is fetched from the VM metadata and is called.
My guess is that all the services are being stopped at the same time, eventually docker.service itself.
When my container is running, I can get the same level=error msg="error waiting for container: unexpected EOF" with a simple sudo systemctl stop docker.service
Here is a part of my shuting-down script:
#!/bin/bash
# This script must be added in the VM metadata as "shutdown-script" so that
# it is executed when the instance is being preempted.
CONTAINER_NAME="task_subscriber" # For example, "task_subscriber"
logTime() {
local datetime="$(date +"%Y-%m-%d %T")"
echo -e "$datetime: $1" # Console
echo -e "$datetime: $1" >>/var/log/containers/$CONTAINER_NAME.log
}
logTime "Shutting down! TEST SIGTERM SHUTTING DOWN"
echo "docker ps -a" >>/var/log/containers/$CONTAINER_NAME.log
docker ps -a >>/var/log/containers/$CONTAINER_NAME.log
echo "ps -ef" >>/var/log/containers/$CONTAINER_NAME.log
ps -ef >>/var/log/containers/$CONTAINER_NAME.log
if [[ ! "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; then
logTime "${CONTAINER_NAME} not running, shutting down immediately."
sleep 10 # Give time to send logs
exit 0
fi
logTime "Sending SIGTERM to ${CONTAINER_NAME}"
#docker kill --signal=SIGTERM ${CONTAINER_NAME}
systemctl stop taskexecutor.service
# Portable waitpid equivalent
while [[ "$(docker ps -q -f name=${CONTAINER_NAME})" ]]; do
sleep 1
logTime "Waiting for ${CONTAINER_NAME} termination"
done
logTime "${CONTAINER_NAME} is done, shutting down."
logTime "TEST SIGTERM SHUTTING DOWN BYE BYE"
sleep 10 # Give time to send logs
If I simply call systemctl stop taskexecutor.service manually (not by really shutting down the server), the SIGTERM signal is sent to my docker and my app properly handle it and exists.
Any idea?
-- How I solved my issue --
I could solve it by adding this dependency on docker in my service config:
[Unit]
Wants=gcr-online.target docker.service
After=gcr-online.target docker.service
I don't know how the magic works beyond the execution of the shutdown-script stored in metadata by Google. But I think that they should fix something in their Container-Optimized OS so that that magic happens before docker is stopped. Otherwise, we could not rely on it to gracefully shutdown a basic script with it (hopefully I was using systemd here)...
From the documentation[1] usage of shutdown scripts on the preemptible VM instances are feasible. However, it seems there are some limitations in place while using the shutdown scripts, Compute Engine executes shutdown scripts only on a best-effort basis. In rare cases, Compute Engine cannot guarantee that the shutdown script will complete. Also I would like to mention Preemptible instances has 30 seconds after instance preemption begins[2] which might be killing the docker before the shutdown script was executed.
From the error message provided in your use case, it seems to be an expected behaviour with the Docker running continuously for longer time[3].
[1]https://cloud.google.com/compute/docs/instances/create-start-preemptible-instance#handle_preemption
[2] https://cloud.google.com/compute/docs/shutdownscript#limitations
[3] https://github.com/docker/for-mac/issues/1941

How to make GCE instance stop when its deployed container finishes?

I have a Docker container that performs a single large computation. This computation requires lots of memory and takes about 12 hours to run.
I can create a Google Compute Engine VM of the appropriate size and use the "Deploy a container image to this VM instance" option to run this job perfectly. However once the job is finished the container quits but the VM is still running (and charging).
How can I make the VM exit/stop/delete when the container exits?
When the VM is in its zombie mode only the stackdriver containers are left running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
bfa2feb03180 gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1 "/entrypoint.sh /u..." 17 hours ago Up 17 hours stackdriver-logging-agent
161439a487c2 gcr.io/stackdriver-agents/stackdriver-metadata-agent:0.2-0.0.17-2 "/bin/sh -c /opt/s..." 17 hours ago Up 17 hours 8000/tcp stackdriver-metadata-agent
I create the VM like this:
gcloud beta compute --project=abc instances create-with-container vm-name \
--zone=us-central1-c --machine-type=custom-1-65536-ext \
--network=default --network-tier=PREMIUM --metadata=google-logging-enabled=true \
--maintenance-policy=MIGRATE \
--service-account=xyz \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image=cos-stable-69-10895-71-0 --image-project=cos-cloud --boot-disk-size=10GB \
--boot-disk-type=pd-standard --boot-disk-device-name=vm-name \
--container-image=gcr.io/abc/my-image --container-restart-policy=on-failure \
--container-command=python3 \
--container-arg="a" --container-arg="b" --container-arg="c" \
--labels=container-vm=cos-stable-69-10895-71-0
When you create the VM, you'll need to give it write access to compute so you can delete the instance from within. You should also set container environment variables like gce_zone and gce_project_id at this time. You'll need them to delete the instance.
gcloud beta compute instances create-with-container {NAME} \
--container-env=gce_zone={ZONE},gce_project_id={PROJECT_ID} \
--service-account={SERVICE_ACCOUNT} \
--scopes=https://www.googleapis.com/auth/compute,...
...
Then within the container, whenever YOU determine your task is finished:
request an api token (im using curl for simplicity and DEFAULT gce service account)
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google"
This will respond with json that looks like
{
"access_token": "foobarbaz...",
"expires_in": 1234,
"token_type": "Bearer"
}
Take that access token and hit the instances.delete api endpoint (notice the environment variables)
curl -XDELETE -H 'Authorization: Bearer {TOKEN}' https://www.googleapis.com/compute/v1/projects/$gce_project_id/zones/$gce_zone/instances/$HOSTNAME
Having grappled with the problem for some time, here's a full solution that works pretty well.
This solution doesn't use the "start machine with a container image" option. Instead it uses a startup script, which is more flexible. You still use a Container-Optimized OS instance.
Create a startup script:
#!/usr/bin/env bash
# get image name and container parameters from the metadata
IMAGE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/image_name -H "Metadata-Flavor: Google")
CONTAINER_PARAM=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/container_param -H "Metadata-Flavor: Google")
# This is needed if you are using a private images in GCP Container Registry
# (possibly also for the gcp log driver?)
sudo HOME=/home/root /usr/bin/docker-credential-gcr configure-docker
# Run! The logs will go to stack driver
sudo HOME=/home/root docker run --log-driver=gcplogs ${IMAGE_NAME} ${CONTAINER_PARAM}
# Get the zone
zoneMetadata=$(curl "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor:Google")
# Split on / and get the 4th element to get the actual zone name
IFS=$'/'
zoneMetadataSplit=($zoneMetadata)
ZONE="${zoneMetadataSplit[3]}"
# Run compute delete on the current instance. Need to run in a container
# because COS machines don't come with gcloud installed
docker run --entrypoint "gcloud" google/cloud-sdk:alpine compute instances delete ${HOSTNAME} --delete-disks=all --zone=${ZONE}
Put the script somewhere public. For example put it on Cloud Storage and create a public URL. You can't use a gs:// URI for a COS startup script.
Start an instance using a startup-script-url, and passing the image name and parameters, e.g.:
gcloud compute --project=PROJECT_NAME instances create INSTANCE_NAME \
--zone=ZONE --machine-type=TYPE \
--metadata=image_name=IMAGE_NAME,\
container_param="PARAM1 PARAM2 PARAM3",\
startup-script-url=PUBLIC_SCRIPT_URL \
--maintenance-policy=MIGRATE --service-account=SERVICE_ACCUNT \
--scopes=https://www.googleapis.com/auth/cloud-platform --image-family=cos-stable \
--image-project=cos-cloud --boot-disk-size=10GB --boot-disk-device-name=DISK_NAME
(You probably want to limit the scopes, the example uses full access for simplicity)
I wrote a self-contained Python function based on Vincent's answer.
def kill_vm():
"""
If we are running inside a GCE VM, kill it.
"""
# based on https://stackoverflow.com/q/52748332/321772
import json
import logging
import requests
# get the token
r = json.loads(
requests.get("http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token",
headers={"Metadata-Flavor": "Google"})
.text)
token = r["access_token"]
# get instance metadata
# based on https://cloud.google.com/compute/docs/storing-retrieving-metadata
project_id = requests.get("http://metadata.google.internal/computeMetadata/v1/project/project-id",
headers={"Metadata-Flavor": "Google"}).text
name = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/name",
headers={"Metadata-Flavor": "Google"}).text
zone_long = requests.get("http://metadata.google.internal/computeMetadata/v1/instance/zone",
headers={"Metadata-Flavor": "Google"}).text
zone = zone_long.split("/")[-1]
# shut ourselves down
logging.info("Calling API to delete this VM, {zone}/{name}".format(zone=zone, name=name))
requests.delete("https://www.googleapis.com/compute/v1/projects/{project_id}/zones/{zone}/instances/{name}"
.format(project_id=project_id, zone=zone, name=name),
headers={"Authorization": "Bearer {token}".format(token=token)})
A simple atexit hook gets me my desired behavior:
import atexit
atexit.register(kill_vm)
Another solution is to not use GCE and instead use AI Platform's custom job service, which automatically shuts down the VM after the Docker container exits.
gcloud ai-platform jobs submit training $JOB_NAME \
--region $REGION \
--master-image-uri $IMAGE_URI
You can specify --master-machine-type.
See the GCP documentation on custom containers.
The simplest way, from within the container, once it's finished:
ZONE=`gcloud compute instances list --filter="name=($HOSTNAME)" --format 'csv[no-heading](zone)'`
gcloud compute instances delete $HOSTNAME --zone=$ZONE -q
-q skips the interactive confirmation
$HOSTNAME is already exported
Just use curl and the local metadata server (no need for Python scripts or gcloud). Add the following to the end of your Docker Entrypoint script, so it's run when the container finishes:
# Note: inside the container the name is exposed as $HOSTNAME
INSTANCE_NAME=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/name" -H "Metadata-Flavor: Google")
INSTANCE_ZONE=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google")
echo "Terminating instance [${INSTANCE_NAME}] in zone [${INSTANCE_ZONE}}"
TOKEN=$(curl -sq "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H "Metadata-Flavor: Google" | jq -r '.access_token')
curl -X DELETE -H "Authorization: Bearer ${TOKEN}" https://www.googleapis.com/compute/v1/$INSTANCE_ZONE/instances/$INSTANCE_NAME
For security sake, and Principle of Least Privilege, you can run the VM with a custom service account, and give that service account a role, with this permission (a custom role is best).
compute.instances.delete

Starting with sawtooth xo transaction

I have setup Hyperledger Sawtooth on Docker,
I am trying to test sawtooth XO transactions with following commands
uname#uname:~/sawtooth$ docker exec -it sawtooth-shell-default bash
root#5279e5a413c1:/# xo create one
but I am getting following error
Error: Failed to connect to http://127.0.0.1:8008/batches: HTTPConnectionPool(host='127.0.0.1', port=8008): Max retries exceeded with url: /batches (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))
FYKI This commands works for me
From shell this gives me blocks
curl http://rest-api:8008/blocks
From my host this works as expected
curl http://localhost:8008/blocks
curl http://127.0.0.1:8008/blocks
What is wrong with this?
My yaml file is the default one, you can find it here
If you are using docker you have to mention url to APIs this way
xo create one --url http://rest-api:8008
I found this after Ashish gave this hint at
https://chat.hyperledger.org/channel/sawtooth

userparameters and ZBX_NOTSUPPORTED

I want to ping an external ip from all of my servers that run zabbix agent.
I searched and find some articles about zabbix user parameters.
In /etc/zabbix/zabbix_agentd.conf.d/ I created a file named userparameter_ping.conf with following content:
UserParameter=checkip[*],ping -c4 8.8.8.8 && echo 0 || echo 1
I created an item named checkip in zabbix server with a graph but got no data. After some another digging I found zabbix_get and tested my userparameter but I got the error : ZBX_NOTSUPPORTED
# zabbix_get -s 172.20.4.43 -p 10050 -k checkip
my zabbix version :
Zabbix Agent (daemon) v2.4.5 (revision 53282) (21 April 2015)
Does anybody know what I can do to address this?
After some change and talks with folks in mailing list finally it worked but how :
first i created a file in :
/etc/zabbix/zabbix_agentd.conf.d/
and add this line :
UserParameter=checkip[*],ping -W1 -c2 $1 >/dev/null 2>&1 && echo 0 || echo 1
and run this command :
./sbin/zabbix_agentd -t checkip["8.8.8.8"]
checkip[8.8.8.8] [t|0]
so everything done but Timeout option is very important for us :
add time out in /etc/zabbix/zabbix_agentd.conf
Timeout=30
Timeout default is 3s so if we run
time ping -W1 -c2 8.8.8.8
see maybe it takes more than 3s so you got error :
ZBX_NOTSUPPORTED
It can be anything. For example timeout - default timeout is 3 sec and ping -c4 requires at least 3 seconds, permission/path to ping, not restarted agent, ...
Increase debug level, restart agent and check zabbix logs. Also you can test zabbix_agentd directly:
zabbix_agentd -t checkip[]
[m|ZBX_NOTSUPPORTED] [Timeout while executing a shell script.] => Timeout problem. Edit zabbix_agentd.conf and increase Timeout settings. Default 3 seconds are not the best for your ping, which needs 3+ seconds.
If you need more than 30s for the execution, you can use the nohup (command..) & combo to curb the timeout restriction.
That way, if you generate some file with the results, in the next pass, you can read the file and get back the results without any need to wait at all.
For those who may be experiencing other issues with the same error message.
It is important to run zabbix_agentd with the -c parameter:
./sbin/zabbix_agentd -c zabbix_agentd.conf --test checkip["8.8.8.8"]
Otherwise zabbix might not pick up on the command and will thus yield ZBX_NOTSUPPORTED.
It also helps to isolate the command into a script file, as Zabbix will butcher in-line commands in UserParameter= much more than you'd expect.
I defined two user parameters like this for sync checking between to samba DCs.
/etc/zabbix/zabbix_agentd.d/userparameter_samba.conf:
UserParameter=syncma, sudo samba-tool drs replicate smb1 smb2 cn=schema,cn=configuration,dc=domain,dc=com
UserParameter=syncam, sudo samba-tool drs replicate smb2 smb1 cn=schema,cn=configuration,dc=domain,dc=com
and also provided sudoer access for Zabbix user to execute the command. /etc/sudoers.d/zabbix:
Defaults:zabbix !syslog
Defaults:zabbix !requiretty
zabbix ALL=(ALL) NOPASSWD: /usr/bin/samba-tool
zabbix ALL=(ALL) NOPASSWD: /usr/bin/systemctl
And "EnableRemoteCommands" is enabled on my zabbix_aganetd.conf, sometimes when I run
zabbix_get -s CLIENT_IP -p10050 -k syncma or
zabbix_get -s CLIENT_IP -p10050 -k syncam
I get the error ZBX_NOTSUPPORTED: Timeout while executing a shell script.
but after executing /sbin/zabbix_agentd -t syncam on the client, Zabbix server just responses normally.
Replicate from smb2 to smb1 was successful.
and when it has a problem I get below error on my zabbix.log
failed to kill [ sudo samba-tool drs replicate smb1 smb2 cn=schema,cn=configuration,dc=domain,dc=com]: [1] Operation not permitted
It seems like it is a permission error! but It just resolved after executing /sbin/zabbix_agentd -t syncam but I am not sure the error is gone permanently or will happen at the next Zabbix item check interval.