How to automatically exit/stop the running instance - google-compute-engine

I have managed to create an instance and ssh into it. However, I have couple of questions regarding the Google Compute Engine.
I understand that I will be charged for the time my instance is running. That is till I exit out of the instance. Is my understanding correct?
I wish to run some batch job (java program) on my instance. How do I make my instance stop automatically after the job is complete (so that I don't get charged for the additional time it may run)
If I start the job and disconnect my PC, will the job continue to run on the instance?
Regards,
Asim

Correct, instances are charged for the time they are running. (to the minute, minimum 10 minutes). Instances run from the time they are started via the API until they are stopped via the API. It doesn't matter if any user is logged in via SSH or not. For most automated use cases users never log in - programs are installed and started via start up scripts.
You can view your running instances via the Cloud Console, to confirm if any are currently running.
If you want to stop your instance from inside the instance, the easiest way is to start the instance with the compute-rw Service Account Scope and use gcutil.
For example, to start your instance from the command line with the compute-rw scope:
$ gcutil --project=<project-id> addinstance <instance name> --service_account_scopes=compute-rw
(this is the default when manually creating an instance via the Cloud Console)
Later, after your batch job completes, you can remove the instance from inside the instance:
$ gcutil deleteinstance -f <instance name>

You can put halt command at the end of your batch script (assuming that you output your results on persistent disk).
After halt the instance will have a state of TERMINATED and you will not be charged.
See https://developers.google.com/compute/docs/pricing
scroll downn to "instance uptime"

You can auto shutdown instance after model training. Just run few extra lines of code after the model training is complete.
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
service = discovery.build('compute', 'v1', credentials=credentials)
# Project ID for this request.
project = 'xyz' # Project ID
# The name of the zone for this request.
zone = 'xyz' # Zone information
# Name of the instance resource to stop.
instance = 'xyz' # instance id
request = service.instances().stop(project=project, zone=zone, instance=instance)
response = request.execute()
add this to your model training script. When the training is complete GCP instance automatically shuts down.
More info on official website:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/stop

If you want to stop the instance using the python script, you can follow this way:
from google.cloud.compute_v1.services.instances import InstancesClient
from google.oauth2 import service_account
instance_client = InstancesClient().from_service_account_file(<location-path>)
zone = <zone>
project = <project>
instance = <instance_id>
instance_client.stop(project=project, instance=instance, zone=zone)
In the above script, I have assumed you are using service-account for authentication. For documentation of libraries used you can go here:
https://googleapis.dev/python/compute/latest/compute_v1/instances.html

Related

Test databases in gitlab ci

I would like to use test databases for feature branches.
Of course it would be best to create a gitlab ci environment on the fly (review apps style) and also create a test database on the target system with the same name. Unfortunately, this is not possible because the MySQL databases in the target system have fixed names, like xxx_1, xxx_2 etc. and this cannot be changed without moving to a different hosting provider.
So I would like to do something like "grab an empty test data base from the given xxx_n and then empty it again when the branch is deleted".
How could this be handled with gitlab ci?
Can I set a variable on the project that says "feature branch Y already uses database xxx_4"?
Or should I put a table into the test database to store this information?
Using dynamic environments/variables and stop jobs might be able to do the trick. Stop jobs will run when the environment is "stopped" -- in the case of feature branches without associated MRs, when the feature branch is deleted (or if there is an open MR for the review app, when the MR is merged or closed)
Can I set a variable on the project that says "feature branch Y already uses database xxx_4"?
One way may be to put the db name directly in the environment name. Then the Environments API keeps track of this.
stages:
- pre-deploy
- deploy
determine_database:
stage: pre-deploy
image: python:3.9-slim
script:
- pip install python-gitlab
- database_name=$(determine-database) # determine what database names are not currently in use
- echo "database_name=${database_name}" > vars.env
artifacts:
reports: # automatically set $database_name variable in subsequent jobs
dotenv: "vars.env"
deploy_review_app:
stage: deploy
environment:
name: review/$CI_COMMIT_REF_SLUG/$database_name
on_stop: teardown
script:
- echo "deploying review app for $CI_COMMIT_REF with database name configuration $database_name"
- ... # steps to actually do the deploy
teardown: # this will trigger when the environment is stopped
stage: deploy
variables:
GIT_STRATEGY: none # ensures this works even if the branch is deleted
when: manual
script:
- echo "tearing down test database $database_name"
- ... # actual script steps to stop env and cleanup database
environment:
name: review/$CI_COMMIT_REF_SLUG/$database_name
action: "stop"
The implementation of the determine-database command may have to connect to your database to determine what database names are available (or perhaps you have a set of these provisioned in advance). You can then inspect the GitLab environments API to see what database names are still in use (since it's baked into the environment name).
For example, you might have something like this. Here, I am using the python-gitlab API wrapper just because it's most familiar to me, but the same principle can be applied to any method of calling the GitLab REST API.
#!/usr/bin/env python3
import gitlab
import os, sys, random
GITLAB_URL = os.environ['CI_SERVER_URL']
PROJECT_TOKEN = os.environ['MY_PROJECT_TOKEN'] # you generate and add this to your CI/CD variables!
PROJECT_ID = os.environ['CI_PROJECT_ID']
DATABASE_NAMES = ['xxx_1', 'xxx_2', 'xxx_3'] # or determine this programmatically by connecting to the DB
gl = gitlab.Gitlab(GITLAB_URL, private_token=PROJECT_TOKEN)
in_use_databases = []
project = gl.projects.get(PROJECT_ID)
for environment in project.environments.list(state='available', all=True):
# the in-use database name is the string after the last '/' in the env name
in_use_db_name = environment.name.split('/')[-1]
in_use_databases.append(in_use_db_name)
available_databases = [name for name in DATABASE_NAMES if name not in in_use_databases]
if not available_databases: # bail if all databases are in use
print('FATAL. no available databases', file=sys.stderr)
raise SystemExit(1)
# otherwise pick one and output to stdout
db_name = random.choice(available_databses)
# optionally you could prepare the database here, too, instead of relying on the `on_stop` job.
print(db_name)
There is a potential concurrency problem here (two runs of determine_database concurrently on different branches can potentially select the same db twice before either finish) but that could be addressed with resource locks.

Startup script from Bitbucket (https) fail to download, but works if instance is reset

I am programatically launching a new instance using the Compute Engine API for Go [1], and a tool I made called vmproxy [2].
The problem I have is that if I launch a preemptible VM using a startup-script-url pointing to https://bitbucket.org/ronoaldo/debian-custom/raw/tip/tools/autobuild, the build script fails to download. I can see in the serial console output that the the startup script metadata is there, and that it attempts to be downloaded with curl, but that part fails.
However, if I reset the instance via the developers console, the script is properly downloaded and runs nicelly.
The code I am using to setup the instance is:
// Ronolinux is a VM Proxy that runs an live systems build on Compute Engine
var (
Ronolinux = &vmproxy.VM{
Path: "/",
Instance: vmproxy.Instance{
Name: "ronolinux-buildd",
Zone: "us-central1-f",
Image: vmproxy.ResourcePrefix + "/debian-cloud/global/images/debian-8-jessie-v20150915",
MachineType: "n1-standard-1",
Metadata: map[string]string{
"startup-script-url": "https://bitbucket.org/ronoaldo/debian-custom/raw/tip/tools/autobuild",
"shutdown-script": `!#/bin/bash
gsutil cp /var/log/startupscript.log gs://ronoaldo/ronolinux/build-$(date +%Y%m%d%H%M%S).log
`,
},
Scopes: []string{ storageReadWrite },
},
}
)
[1] https://godoc.org/google.golang.org/api/compute/v1
[2] https://godoc.org/ronoaldo.gopkg.net/aetools/vmproxy
If your startup script is not hosted on Cloud Storage, there is a random chance the download will fail. If you look at the serial console output, make sure to scroll horizontally, as it will not wrap long lines. In my case, the error line was very long, and this hidded the real end of the message:
(... long curl on-line progress output )
curl: (7) Failed to connect to bitbucket.org port 443: Connection timed out
(...)
Your host must respond within a 10s timeout. In my case, the first boot usually failed to contact Bitbucket, hence failing to download the script; a VM reset also made things work, as the network latency outside Google Cloud were probably better.
I ended up moving to host the script on cloud storage to avoid these issues.

Starting instance again after power off

How do I start instance on GCE again after power off.
Instance shows TERMINATED , but has PERSISTENT disk type.
if I use add instance with the same instance name it asks me for the
Select an new image with only choice of OS level, not my existing disk.
then fails with
ERROR: RESOURCE_ALREADY_EXISTS: The resource XXXX already exists
Is there way to start (or clone) copy of image once stopped?
Anything similar to AWS stop/start. I don't care about instance state or scratch to be saved, just start since I have boot disk stored and payed for.
Success, below is stop/start procedure, assuming that $PROJECT and $INSTANCE are set appropriately:
#--------- stop instance -----
#connect and shutdown
gcutil --project=$PROJECT ssh $INSTANCE
sudo shutdown -h now
# check
gcutil listinstances --project $PROJECT
#delete instance/keep boot disk , use -f to avoid confirmation
gcutil --project=$PROJECT deleteinstance $INSTANCE --nodelete_boot_pd
# check disks
gcutil listdisks --project=$PROJECT
#--------- start new instance -----
# launch instance using the existing disk (has to be in the same zone!)
gcutil --project=$PROJECT addinstance $INSTANCE --disk=$DISK,boot --zone=$ZONE --machine_type=n1-standard-1
#check that it's running
gcutil listinstances --project $PROJECT
You're on the right track. You just need to delete the existing TERMINATED instance before adding it again.
Even though the instance isn't running when it is TERMINATED, the resources (such as Persistent Disk) are still allocated to it.
Also, if this instance was created before December 5th, (when Compute Engine went GA), you'll need to add a kernel to the disk or it won't boot. See the transition guide for details.
(For a temporary work around to upgrading the kernel, see this Q/A: My Google Compute Engine instances hang during boot using the v1 API)

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?
We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE
On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90
Run Gunicorn with --log-level debug.
It should give you an app stack trace.
Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent
The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app
WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place
Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.
This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent
I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3
I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.
You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.
If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime
timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me
For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.
For me, it was because I forgot to setup firewall rule on database server for my Django.
Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).
Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.
The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!
Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;
In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404
in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

SSIS: Accessing a network drive using a different username and password

Is there a way to connect to a network drive that requires a different username/password than the username/password of the user running the package?
I need to copy files from a remote server. Right now I map the network drive in Windows Explorer then do I filesystem task. However, eventually this package will be ran automatically, from a different machine, and will need to map the network drive on its own. Is this possible?
You can use the Execute Process task with the "net use" command to create the mapped drive. Here's how the properties of the task should be set:
Executable: net
Arguments: use \Server\SomeShare YourPassword /user:Domain\YourUser
Any File System tasks following the Execute Process will be able to access the files.
Alternative Method
This Sql Server Select Article covers the steps in details but the basics are:
1) Create a "Execute Process Task" to map the network drive (this maps to the z:)
Executable: cmd.exe
Arguments: /c "NET USE Z: "\\servername\shareddrivename" /user:mydomain\myusername mypassword"
2) Then run a "File System Task" to perform the copy. Remember that the destination "Flat File Connection" must have "DelayValidation" set to True as z:\suchandsuch.csv won't exist at design time.
3) Finally, unmap the drive when you're done with another "Execute Process Task"
Executable: cmd.exe
Arguments: /c "NET USE Z: /delete"
Why not use an FTP task to GET the files over to the local machine? Run SSIS on the local machine. When transferring using FTP in binary, its real fast. Just remember that the ROW delimter for SSIS should be LF, not CRLF, as binary FTp does not convert LF (unix) to CRLF (windows)
You have to map the network drive, here's an example that I'm using now:
profile = "false"
landingPadDir = Dts.Variables("strLandingPadDir").Value.ToString
resultsDir = Dts.Variables("strResultsDir").Value.ToString
user = Dts.Variables("strUserName").Value.ToString
pass = Dts.Variables("strPassword").Value.ToString
driveLetter = Dts.Variables("strDriveLetter").Value.ToString
objNetwork = CreateObject("WScript.Network")
CheckDrive = objNetwork.EnumNetworkDrives()
If CheckDrive.Count > 0 Then
For intcount = 0 To CheckDrive.Count - 1 Step 2 'if drive is already mapped, then disconnect it
If CheckDrive.Item(intcount) = driveLetter Then
objNetwork.RemoveNetworkDrive(driveLetter)
End If
Next
End If
objNetwork.MapNetworkDrive(driveLetter, landingPadDir, profile, user, pass)
From There just use that driveLetter and access the file via the mapped drive.
I'm having one issue (which led me here) with a new script that accesses two share drives and performs some copy/move operations between the drives and I get an error from SSIS that says:
This network connection has files open or requests pending.
at Microsoft.VisualBasic.CompilerServices.LateBinding.InternalLateCall(Object o, Type objType, String name, Object[] args, String[] paramnames, Boolean[] CopyBack, Boolean IgnoreReturn)
at Microsoft.VisualBasic.CompilerServices.NewLateBinding.LateCall(Object Instance, Type Type, String MemberName, Object[] Arguments, String[] ArgumentNames, Type[] TypeArguments, Boolean[] CopyBack, Boolean IgnoreReturn)
at ScriptTask_3c0c366598174ec2b6a217c43470f581.ScriptMain.Main()
This is only on the "2nd run" of the process and if I run it a 3rd time it all works fine so I'm guessing the connection isn't being properly closed or it is not waiting for the copy/move to complete before moving forward or some such, but I'm unable to find a "close" or "flush" command that prevents this error. If you have any solution, please let me know, but the above code should work for getting the drive mapped using your alternate credentials and allow you to access that share.
Zach
To make the package more robust, you can do the following;
In the first Execute Process Task, set - FailTaskIfReturnCodeNotSuccessValue = False
This will let the package run if the last disconnect has not worked.
This is an older question but more recent versions of SQL Server with SSIS databases allow you to use a proxy to execute SQ Server jobs.
In SSMS Under Security<Credentials set up a credential in the database mapped to the AD account you want to use.
Under SQL Server Agent create a new proxy giving it the credential from step 1 and permissions to execute SSIS packages.
Under the SQL Server Agent jobs create a new job that executes your package
Select the step that executes the package and click EDIT. In the Run As dropdown select the Proxy you created in step 2