Soft restart daemon containers in docker swarm - updates

we use multiple PHP workers. Every PHP worker is organized in one container. To scale the amount of parallel working processes we handle it in a docker swarm.
So the PHP is running in a loop and waiting for new jobs (Get jobs from Gearman).
If a new job is receiving, it would be processed. After that, the script is waiting for the next job without quitting/leaving the PHP script.
Now we want to update our workers. In this case, the image is the same but the PHP script is changed.
So we have to leave the PHP script, update the PHP script file, and restart the PHP script.
If I use this docker service update command. Docker will stop the container immediately. In the worst case, a running worker will be canceled during this work.
docker service update --force PHP-worker
Is there any possibility to restart the docker container soft?
Soft means, give the container a sign: "I have to do a restart, please cancel all running processes." That the container has the chance to quit his work.
In my case, before I run the next process in the loop. I will check this cancel flag. If this cancel flag set I will end the loop and end running the PHP script.
Environment:
Debian: 10
Docker: 19.03.12
PHP: 7.4

In the meantime, we have solved it with SIGNALS.
In PHP work with signals is very easy. In our case, this structure helped us.
//Terminate flag
$terminate = false;
//Register signals
pcntl_async_signals(true);
pcntl_signal(SIGTERM, function() use (&$terminate) {
echo"Get SIGTERM. End worker LOOP\n";
$terminate = true;
});
pcntl_signal(SIGHUP, function() use (&$terminate) {
echo"Get SIGHUP. End worker LOOP\n";
$terminate = true;
});
//loop
while($terminate === false){
//do next job
}
Before the next job is started it is checked if the terminate flag is set.
Docker has great support for gracefully stopping containers.
To define the time to wait we used the tag "stop_grace_period".

Related

How to kill chromedriver

After running my protractor tests I may be left with chromedriver.exe running. The simple question is: how do I kill it? There are several things to note here:
I cannot just kill based on process name since several other chromedrivers may be running and may be needed by other tests.
I already stop the selenium server using "curl http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer"
I noticed that the chromedriver is listening on port 33107 (is it possible to specify this port somehow?), but I do not know how should I call it to quit.
Probably I should be using driver.quit() in my tests, but on some occasions it might not get called (eg. when the build is cancelled).
Any ideas how to kill the proper chromedriver process from command line (eg. using curl)?
The proper way to do it's as you mentioned by using driver.quit() in your tests.
Actually, to be exact in your test cleanup method, since you want a fresh instance of the browser every time.
Now, the problem with some Unit Test Frameworks (like MSTest for example) is that if your test initialize method fails, the test cleanup one will not be called.
As a workaround for this you can surround in a try-catch statement you test initialize with catch calling and executing your test cleanup.
public void TestInitialize()
{
try
{
//your test initialize statements
}
catch
{
TestCleanup();
//throw exception or log the error message or whatever else you need
}
}
public void TestCleanup()
{
driver.Quit();
}
EDIT:
For the case when the build is canceled, you can create a method that kills all open instances of Chrome browser and ChromeDriver that gets executed before you start a new suite of tests.
E.g. if your Unit Testing Framework used has something similar to Class Initialize or Assembly Initialize you can do it there.
However, on a different post I found this approach:
PORT_NUMBER=1234
lsof -i tcp:${PORT_NUMBER} | awk 'NR!=1 {print $2}' | xargs kill
Breakdown of command
(lsof -i tcp:${PORT_NUMBER}) -- list all processes that is listening on that tcp port
(awk 'NR!=1 {print $2}') -- ignore first line, print second column of each line
(xargs kill) -- pass on the results as an argument to kill. There may be several.
Here, to be more exact: How to find processes based on port and kill them all?

Running migrations with Rails in a Docker container with multiple container instances

I've seen lots of examples of making Docker containers for Rails applications. Typically they run a rails server and have a CMD that runs migrations/setup then brings up the Rails server.
If I'm spawning 5 of these containers at the same time, how does Rails handle multiple processes trying to initiate the migrations? I can see Rails checking the current schema version in the general query log (it's a MySQL database):
SELECT `schema_migrations`.`version` FROM `schema_migrations`
But I can see a race condition here if this happens at the same time on different Rails instances.
Considering that DDL is not transactional in MySQL and I don't see any locks happening in the general query log while running migrations (other than the per-migration transactions), it would seem that kicking them off in parallel would be a bad idea. In fact if I kick this off three times locally I can see two of the rails instances crashing when trying to create a table because it already exists while the third rails instance completes the migrations happily. If this was a migration that inserted something into the database it would be quite unsafe.
Is it then a better idea to run a single container that runs migrations/setup then spawns (for example) a Unicorn instance which in turn spawns multiple rails workers?
Should I be spawning N rails containers and one 'migration container' that runs the migration then exits?
Is there a better option?
Especially with Rails I don't have any experience, but let's look from a docker and software engineering point of view.
The Docker team advocates, sometimes quite aggressively, that containers are about shipping applications. In this really great statement, Jerome Petazzoni says that it is all about separation of concerns. I feel that this is exactly the point you already figured out.
Running a rails container which starts a migration or setup might be good for initial deployment and probably often required during development. However, when going into production, you really should consider separating the concerns.
Thus I would say have one image, which you use to run N rails container and add a tools/migration/setup whatever container, which you use to do administrative tasks. Have a look what the developers from the official rails image say about this:
It is designed to be used both as a throw away container (mount your source code and start the container to start your app), as well as the base to build other images off of.
When you look at that image there is no setup or migration command. It is totally up to the user how to use it. So when you need to run several containers just go ahead.
From my experience with mysql this works fine. You can run a data-only container to host the data, run a container with the mysql server and finally run a container for administrative tasks like backup and restore. For all three containers you can use the same image. Now you are free to access your database from let's say several Wordpress containers. This means clear separation of concerns. When you use docker-compose it is not that difficult to manage all those containers. Certainly there are already many third party containers and tools to also support you with setting up a complex application consisting of several containers.
Finally, you should decide whether docker and the micro-service architecture is right for your problem. As outlined in this article there are some reasons against. One of the core problems being that it adds a whole new layer of complexity. However, that is the case with many solutions and I guess you are aware of this and willing to except it.
docker run <container name> rake db:migrate
Starts you standard application container but don't run the CMD (rails server), but rake db:migrate
UPDATE: Suggested by Roman, the command would now be:
docker exec <container> rake db:migrate
Having the same pb publishing to a docker swarm, I put here a solution partially grabbed from others.
Rails has already a mechanism to detect concurrent migrations by using a lock on the database. But it triggers ConcurrentException where it should just wait.
One solution is then to have a loop, that whenever a ConcurrentException is thrown, just wait for 5s et then redo the migration.
This is especially important that all containers perform the migration as the migration fails, all containers must fails.
Solution from coffejumper
namespace :db do
namespace :migrate do
desc 'Run db:migrate and monitor ActiveRecord::ConcurrentMigrationError errors'
task monitor_concurrent: :environment do
loop do
puts 'Invoking Migrations'
Rake::Task['db:migrate'].reenable
Rake::Task['db:migrate'].invoke
puts 'Migrations Successful'
break
rescue ActiveRecord::ConcurrentMigrationError
puts 'Migrations Sleeping 5'
sleep(5)
end
end
end
end
And sometimes you have other processes you want to execute also one by one to perform the migration like after_party, cron setup, etc... The solution is then to use the same mechanism as Rails to embed rake tasks around a database lock:
Below, based on Rails 6 code, the migrate_without_lock performs the needed migrations while with_advisory_lock gets database lock (triggering ConcurrentMigrationError if lock cannot be acquired).
module Swarm
class Migration
def migrate
with_advisory_lock { migrate_without_lock }
end
private
def migrate_without_lock
**puts "Database migration"
Rake::Task['db:migrate'].invoke
puts "After_party migration"
Rake::Task['after_party:run'].invoke
...
puts "Migrations successful"**
end
def with_advisory_lock
lock_id = generate_migrator_advisory_lock_id
MyAdvisoryLockBase.establish_connection(ActiveRecord::Base.connection_config) unless MyAdvisoryLockBase.connected?
connection = MDAdvisoryLockBase.connection
got_lock = connection.get_advisory_lock(lock_id)
raise ActiveRecord::ConcurrentMigrationError unless got_lock
yield
ensure
if got_lock && !connection.release_advisory_lock(lock_id)
raise ActiveRecord::ConcurrentMigrationError.new(
ActiveRecord::ConcurrentMigrationError::RELEASE_LOCK_FAILED_MESSAGE
)
end
end
MIGRATOR_SALT = 1942351734
def generate_migrator_advisory_lock_id
db_name_hash = Zlib.crc32(ActiveRecord::Base.connection_config[:database])
MIGRATOR_SALT * db_name_hash
end
end
# based on rails 6.1 AdvisoryLockBase
class MyAdvisoryLockBase < ActiveRecord::AdvisoryLockBase # :nodoc:
self.connection_specification_name = "MDAdvisoryLockBase"
end
end
Then as before, do a loop to wait
namespace :swarm do
desc 'Run migrations tasks after acquisition of lock on database'
task migrate: :environment do
result = 1
(1..10).each do |i|
**Swarm::Migration.new.migrate**
puts "Attempt #{i} sucessfully terminated"
result = 0
break
rescue ActiveRecord::ConcurrentMigrationError
seconds = rand(3..10)
puts "Attempt #{i} another migration is running => sleeping #{seconds}s"
sleep(seconds)
rescue => e
puts e
e.backtrace.each { |m| puts m }
break
end
exit(result)
end
end
Then in your startup script just launch the rake tasks
set -e
bundle exec rails swarm:migrate
exec bundle exec rails server -b "0.0.0.0"
At the end, as your migrations tasks are run by all containers, they must have a mechanism to do nothing when it's already done. (like does db:migrate)
Using this solution, the order in which Swarm launches containers doesn't matter anymore AND if something goes wrong, all containers know the problem :-)
For single container id:
docker exec -it <container ID> bundle exec rails db:migrate
for multiple we can repeat the process for different container, if there number in 1000 the need to script to execute.

How to automatically exit/stop the running instance

I have managed to create an instance and ssh into it. However, I have couple of questions regarding the Google Compute Engine.
I understand that I will be charged for the time my instance is running. That is till I exit out of the instance. Is my understanding correct?
I wish to run some batch job (java program) on my instance. How do I make my instance stop automatically after the job is complete (so that I don't get charged for the additional time it may run)
If I start the job and disconnect my PC, will the job continue to run on the instance?
Regards,
Asim
Correct, instances are charged for the time they are running. (to the minute, minimum 10 minutes). Instances run from the time they are started via the API until they are stopped via the API. It doesn't matter if any user is logged in via SSH or not. For most automated use cases users never log in - programs are installed and started via start up scripts.
You can view your running instances via the Cloud Console, to confirm if any are currently running.
If you want to stop your instance from inside the instance, the easiest way is to start the instance with the compute-rw Service Account Scope and use gcutil.
For example, to start your instance from the command line with the compute-rw scope:
$ gcutil --project=<project-id> addinstance <instance name> --service_account_scopes=compute-rw
(this is the default when manually creating an instance via the Cloud Console)
Later, after your batch job completes, you can remove the instance from inside the instance:
$ gcutil deleteinstance -f <instance name>
You can put halt command at the end of your batch script (assuming that you output your results on persistent disk).
After halt the instance will have a state of TERMINATED and you will not be charged.
See https://developers.google.com/compute/docs/pricing
scroll downn to "instance uptime"
You can auto shutdown instance after model training. Just run few extra lines of code after the model training is complete.
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
service = discovery.build('compute', 'v1', credentials=credentials)
# Project ID for this request.
project = 'xyz' # Project ID
# The name of the zone for this request.
zone = 'xyz' # Zone information
# Name of the instance resource to stop.
instance = 'xyz' # instance id
request = service.instances().stop(project=project, zone=zone, instance=instance)
response = request.execute()
add this to your model training script. When the training is complete GCP instance automatically shuts down.
More info on official website:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/stop
If you want to stop the instance using the python script, you can follow this way:
from google.cloud.compute_v1.services.instances import InstancesClient
from google.oauth2 import service_account
instance_client = InstancesClient().from_service_account_file(<location-path>)
zone = <zone>
project = <project>
instance = <instance_id>
instance_client.stop(project=project, instance=instance, zone=zone)
In the above script, I have assumed you are using service-account for authentication. For documentation of libraries used you can go here:
https://googleapis.dev/python/compute/latest/compute_v1/instances.html

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?
We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE
On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90
Run Gunicorn with --log-level debug.
It should give you an app stack trace.
Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent
The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app
WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place
Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.
This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent
I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3
I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.
You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.
If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime
timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me
For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.
For me, it was because I forgot to setup firewall rule on database server for my Django.
Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).
Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.
The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!
Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;
In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404
in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

How do I PSRemote start procdump such that it persists after the session ends

I can start a persistent process on unix with:
nohup process &
It will continue to run after I close my bash session. I cannot seem to do the same with PowerShell remoting on Windows. I can open a PSRemote session with a server and start a process, but as soon as I close that session it dies. My assumption is this is a benefit of strong sandboxing, but it's a benefit I'd rather work around somehow. Any ideas?
So far I've tried:
$exe ='d:\procdump.exe'
$processArgs = '-ma -e -t -n 3 -accepteula w3wp.exe d:\Dumps'
1) [System.Diagnostics.Process]::Start($exe,$processArgs)
2) Start-Job -ScriptBlock {param($exe,$processArgs) [System.Diagnostics.Process]::Start($exe,$processArgs)} -ArgumentList ($exe,$processArgs)
3) start powershell {param($exe ='d:\procdump.exe', $processArgs = '-ma -e -t -n 3 -accepteula w3wp.exe d:\Dumps') [System.Diagnostics.Process]::Start($exe,$processArgs)}
4) start powershell {param($exe ='d:\procdump.exe', $processArgs = '-ma -e -t -n 3 -accepteula w3wp.exe d:\Dumps') Start-Job -ScriptBlock {param($exe,$processArgs) [System.Diagnostics.Process]::Start($exe,$processArgs)} -ArgumentList ($exe,$processArgs)}
The program runs up until I close the session, then the procdump is reaped. The coolest thing about procdump is it will self-terminate, and I'd like to leave it running to take advantage of that fact.
I'd been starting ADPlus remotely, holding a session open, and just terminating the session to kill the captures. That's kind of handy, but it requires an awful lot of polling, inspecting, and deciding when is the right moment to kill the capture process before filling up the hard drive but after capturing enough dumps to be useful. I can leave procdump running indefinitely while it waits for an appropriate trigger and when it's captured enough data it will just die. That's lovely.
I just need to get procdump to keep running after I terminate my remote session. It's probably not worth creating a procdump scheduled task and starting it, but that's about the last idea I've got left.
Thanks.
This is not directly possible. Indirectly, yes a task or a service could be created and started remotely, but simply pushing a process off into the SYSTEM space is not.
I resolved my issue by spawning a local job to start the remote job and remain alive for the required period of time. The local job holds the remote session open then dies at the appropriate time, and the parent local process is able to continue to run uninterrupted and harvest the return value of the remote procdump with ReceiveJob if I happen to care.