Headless servers Opengym AI rendering Error while using ray

Headless servers Opengym AI rendering Error while using ray - reinforcement-learning

While using ray for distributed computation, all the servers are headless (no display). Therefore, using "xvfb-run -s “-screen 0 1400x900x24” to create screen.
Getting error
pyglet.canvas.xlib.NoSuchDisplayException: Cannot connect to “None”
Without ray using only 1 machine, this command works perfectly.
"xvfb-run -s “-screen 0 1400x900x24”
In conclusion, xvfb-run doesn’t work with ray distribution.
Does Ray require extra configuration to achieve this? Is there any other way I can get past this error? I am working on a Car Racing environment from open gym ai which triggers rendering.

I stumbled upon a similar problem, although running a Python script, but also with the CarRacinv-v0 environment.
What worked for me was this
import gym
import pyvirtualdisplay
# Creates a virtual display for OpenAI gym
pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
env = gym.make('CarRacing-v0')
env.reset() # This line failed without the display setup

Related

ARM At91 CPU startup in qemu

ARM AT91 can not startup in QEMU. I can't get any print on the console.
I am trying to use QEMU(latest code pulled by git) to simulate an ARM AT91 board. But when startup the QEMU, I got no print in the console. In my understanding, there would be two steps to achieve this:
1, Property setup with the memory address in QEMU, let the QEMU decompress zImage. In this step, I will see "Uncompressing Linux...done, booting the kernel."
2, Property setup the output device(eg: uart0). I will get the kernel startup message.
I've succeeded in starting up with the ARM versatilePB because the QEMU supports versatilePB itself. The difference between versatilePB and AT91 is they have different SDRAM address. I've tried to modify loader_start to 0x20000000 but it seems still not work.
hwaddr loader_start;//0x2000000, which is AT91 SDRAM address
memory_region_add_subregion(sysmem, 0x2000000, ram);
At least it should print Uncompressing Linux...done, booting the kernel., which indicates that the zImage is executed and decompressed.

QEMU (at least upstream QEMU) does not have a model of the AT91 SoCs. The differences between these systems and ones like the versatilePB that QEMU does support are greater than just "the RAM is at a different address" -- they will have different devices of all kinds (including the UART) which both behave differently and are found at different locations. It is impossible to run bare metal code intended for an AT91 without implementing in QEMU a model of the correct board and at least some of the AT91 devices. The changes required would be much much more substantial than just changing a few addresses for the RAM base address.

Hyperledger Composer CLI Ping to a Business Network returns AccessException

Im trying to learn Hyperledger Composer but seems to be a relatively new technology, i mean there are few tutorials and few solutions to a lot of questions, tutorial does not mention possible error case when following the commands and which means there are is also no solution for those errors.
I have joined the composer channel in their community chat, looks like its running in Discord or something, and asked the same question without a response, i have a better experience here in SO.
This is the problem: I have deployed my business network, installed it, started it, created my network admin card and imported it, then to test if everything is ok i have to command composer network ping --card NAME-OF-MY-ADMIN-CARD
And this error comes:
juan#JuanDeDios:~/proyectos/inovacion/a3-poliza-microservice$ composer network ping --card admin#a3-policy-microservice
Error: transaction returned with failure: AccessException: Participant 'org.hyperledger.composer.system.NetworkAdmin#admin' does not have 'READ' access to resource 'org.hyperledger.composer.system.Network#a3-policy-microservice#0.0.1'
Command failed
I think that it has to do something with the permission.acl file, and gave permission to everyone to everything so there would not be any restrictions to anyone, and tryied again, but failed.
So i thought i had to uninstall my business network and create it again, i deleted my .bna and my network.card files also so everything would be created again, but the same error result.
My other attempt was to update the business network, but didn't work, the same error happened and I'm sure i didn't miss any step from the tutorial. I do also followed the playground tutorial. What i have not done its to create another app with the Yeoman but i will do if i don't find a solution to this problem which would not require me to create another app.
This were my steps:
1-. Created my app with Yeoman
yo hyperledger-composer:businessnetwork
2-. Selected Apache-2.0 for my license
3-. Created a3-policy-microservice as the name of the business network
4-. Created org.microservice.policy (Yeah i switched names but Im totally aware)
5-. Generated my app with a template selecting the NO option
6-. Created my assets, participants and transactions
7-. Changed my permission rules to mine
8-. I generated the .bna file
composer archive create -t dir -n .
9-. Then installed my bna file
composer network install --card PeerAdmin#hlfv1 --archiveFile a3-policy-microservice#0.0.1.bna
10-. Then started my network and created my networkadmin card
composer network start --networkName a3-policy-network --networkVersion 0.0.1 --networkAdmin admin --networkAdminEnrollSecret adminpw --card PeerAdmin#hlfv1 --file networkadmin.card
11-. Imported my card
composer card import --file networkadmin.card
12-. Tried to ping my network
composer network ping --card admin#a3-poliza-microservice
And the error happens
Later i tried to create everything again shutting down my fabric and started it again and creating the network from the first step.
My other attempt was to change the permissions and upgrade my bna network, but it failed too. Im running out of options
Hope this description its not too long to ignore it. Thanks in advance

thanks for the question!
First possibility is that your network name is a3-policy-network but you're pinging a network called a3-poliza-microservice - once you do get the correct ACLs in place (currently, that's the error you're trying to resolve).
The procedure for upgrade would normally be the procedure below:
After your step 12 (where you can't ping the business network due to restrictive ACL conditions, assuming you are using the right network name) you would have:
Make the changes to to include your System ACLs this time eg.
/**
* Sample access control list.
*/
rule SystemACL {
description: "System ACL to permit all access"
participant: "org.hyperledger.composer.system.Participant"
operation: ALL
resource: "org.hyperledger.composer.system.**"
action: ALLOW
}
rule NetworkAdminUser {
description: "Grant business network administrators full access to user resources"
participant: "org.hyperledger.composer.system.NetworkAdmin"
operation: ALL
resource: "**"
action: ALLOW
}
rule NetworkAdminSystem {
description: "Grant business network administrators full access to system resources"
participant: "org.hyperledger.composer.system.NetworkAdmin"
operation: ALL
resource: "org.hyperledger.composer.system.**"
action: ALLOW
}
Update the "version" field in your existing package.json in your Business Network project directory (ie need to change it next increment - eg. update the version property from 0.0.1 to 0.0.2.)
From the same directory, run the following command:
composer archive create --sourceType dir --sourceName . -a a3-policy-network#0.0.2.bna
Now install the new business network code firstly:
composer network install --card PeerAdmin#hlfv1 --archiveFile a3-policy-network#0.0.2.bna
Then perform the requisite upgrade step (single '-' for short form of the parameter):
composer network upgrade -c PeerAdmin#hlfv1 -n a3-policy-network -V 0.0.2
After a few seconds, ping the network again to see ACL changes are now in effect:
composer network ping -c a3-policy-network

How to automatically exit/stop the running instance

I have managed to create an instance and ssh into it. However, I have couple of questions regarding the Google Compute Engine.
I understand that I will be charged for the time my instance is running. That is till I exit out of the instance. Is my understanding correct?
I wish to run some batch job (java program) on my instance. How do I make my instance stop automatically after the job is complete (so that I don't get charged for the additional time it may run)
If I start the job and disconnect my PC, will the job continue to run on the instance?
Regards,
Asim

Correct, instances are charged for the time they are running. (to the minute, minimum 10 minutes). Instances run from the time they are started via the API until they are stopped via the API. It doesn't matter if any user is logged in via SSH or not. For most automated use cases users never log in - programs are installed and started via start up scripts.
You can view your running instances via the Cloud Console, to confirm if any are currently running.
If you want to stop your instance from inside the instance, the easiest way is to start the instance with the compute-rw Service Account Scope and use gcutil.
For example, to start your instance from the command line with the compute-rw scope:
$ gcutil --project=<project-id> addinstance <instance name> --service_account_scopes=compute-rw
(this is the default when manually creating an instance via the Cloud Console)
Later, after your batch job completes, you can remove the instance from inside the instance:
$ gcutil deleteinstance -f <instance name>

You can put halt command at the end of your batch script (assuming that you output your results on persistent disk).
After halt the instance will have a state of TERMINATED and you will not be charged.
See https://developers.google.com/compute/docs/pricing
scroll downn to "instance uptime"

You can auto shutdown instance after model training. Just run few extra lines of code after the model training is complete.
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
service = discovery.build('compute', 'v1', credentials=credentials)
# Project ID for this request.
project = 'xyz' # Project ID
# The name of the zone for this request.
zone = 'xyz' # Zone information
# Name of the instance resource to stop.
instance = 'xyz' # instance id
request = service.instances().stop(project=project, zone=zone, instance=instance)
response = request.execute()
add this to your model training script. When the training is complete GCP instance automatically shuts down.
More info on official website:
https://cloud.google.com/compute/docs/reference/rest/v1/instances/stop

If you want to stop the instance using the python script, you can follow this way:
from google.cloud.compute_v1.services.instances import InstancesClient
from google.oauth2 import service_account
instance_client = InstancesClient().from_service_account_file(<location-path>)
zone = <zone>
project = <project>
instance = <instance_id>
instance_client.stop(project=project, instance=instance, zone=zone)
In the above script, I have assumed you are using service-account for authentication. For documentation of libraries used you can go here:
https://googleapis.dev/python/compute/latest/compute_v1/instances.html

Frequent worker timeout

I have setup gunicorn with 3 workers, 30 worker connections and using eventlet worker class. It is set up behind Nginx. After every few requests, I see this in the logs.
[ERROR] gunicorn.error: WORKER TIMEOUT (pid:23475)
None
[INFO] gunicorn.error: Booting worker with pid: 23514
Why is this happening? How can I figure out what's going wrong?

We had the same problem using Django+nginx+gunicorn. From Gunicorn documentation we have configured the graceful-timeout that made almost no difference.
After some testings, we found the solution, the parameter to configure is: timeout (And not graceful timeout). It works like a clock..
So, Do:
1) open the gunicorn configuration file
2) set the TIMEOUT to what ever you need - the value is in seconds
NUM_WORKERS=3
TIMEOUT=120
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--name $NAME \
--workers $NUM_WORKERS \
--timeout $TIMEOUT \
--log-level=debug \
--bind=127.0.0.1:9000 \
--pid=$PIDFILE

On Google Cloud
Just add --timeout 90 to entrypoint in app.yaml
entrypoint: gunicorn -b :$PORT main:app --timeout 90

Run Gunicorn with --log-level debug.
It should give you an app stack trace.

Is this endpoint taking too many time?
Maybe you are using flask without asynchronous support, so every request will block the call. To create async support without make difficult, add the gevent worker.
With gevent, a new call will spawn a new thread, and you app will be able to receive more requests
pip install gevent
gunicon .... --worker-class gevent

The Microsoft Azure official documentation for running Flask Apps on Azure App Services (Linux App) states the use of timeout as 600
gunicorn --bind=0.0.0.0 --timeout 600 application:app
https://learn.microsoft.com/en-us/azure/app-service/configure-language-python#flask-app

WORKER TIMEOUT means your application cannot response to the request in a defined amount of time. You can set this using gunicorn timeout settings. Some application need more time to response than another.
Another thing that may affect this is choosing the worker type
The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker.
When I got the same problem as yours (I was trying to deploy my application using Docker Swarm), I've tried to increase the timeout and using another type of worker class. But all failed.
And then I suddenly realised I was limitting my resource too low for the service inside my compose file. This is the thing slowed down the application in my case
deploy:
replicas: 5
resources:
limits:
cpus: "0.1"
memory: 50M
restart_policy:
condition: on-failure
So I suggest you to check what thing slowing down your application in the first place

Could it be this?
http://docs.gunicorn.org/en/latest/settings.html#timeout
Other possibilities could be your response is taking too long or is stuck waiting.

This worked for me:
gunicorn app:app -b :8080 --timeout 120 --workers=3 --threads=3 --worker-connections=1000
If you have eventlet add:
--worker-class=eventlet
If you have gevent add:
--worker-class=gevent

I've got the same problem in Docker.
In Docker I keep trained LightGBM model + Flask serving requests. As HTTP server I used gunicorn 19.9.0. When I run my code locally on my Mac laptop everything worked just perfect, but when I ran the app in Docker my POST JSON requests were freezing for some time, then gunicorn worker had been failing with [CRITICAL] WORKER TIMEOUT exception.
I tried tons of different approaches, but the only one solved my issue was adding worker_class=gthread.
Here is my complete config:
import multiprocessing
workers = multiprocessing.cpu_count() * 2 + 1
accesslog = "-" # STDOUT
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(q)s" "%(D)s"'
bind = "0.0.0.0:5000"
keepalive = 120
timeout = 120
worker_class = "gthread"
threads = 3

I had very similar problem, I also tried using "runserver" to see if I could find anything but all I had was a message Killed
So I thought it could be resource problem, and I went ahead to give more RAM to the instance, and it worked.

You need to used an other worker type class an async one like gevent or tornado see this for more explanation :
First explantion :
You may also want to install Eventlet or Gevent if you expect that your application code may need to pause for extended periods of time during request processing
Second one :
The default synchronous workers assume that your application is resource bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. For instance, a request to the internet meets this criteria. At some point the external network will fail in such a way that clients will pile up on your servers.

If you are using GCP then you have to set workers per instance type.
Link to GCP best practices https://cloud.google.com/appengine/docs/standard/python3/runtime

timeout is a key parameter to this problem.
however it's not suit for me.
i found there is not gunicorn timeout error when i set workers=1.
when i look though my code, i found some socket connect (socket.send & socket.recv) in server init.
socket.recv will block my code and that's why it always timeout when workers>1
hope to give some ideas to the people who have some problem with me

For me, the solution was to add --timeout 90 to my entrypoint, but it wasn't working because I had TWO entrypoints defined, one in app.yaml, and another in my Dockerfile. I deleted the unused entrypoint and added --timeout 90 in the other.

For me, it was because I forgot to setup firewall rule on database server for my Django.

Frank's answer pointed me in the right direction. I have a Digital Ocean droplet accessing a managed Digital Ocean Postgresql database. All I needed to do was add my droplet to the database's "Trusted Sources".
(click on database in DO console, then click on settings. Edit Trusted Sources and select droplet name (click in editable area and it will be suggested to you)).

Check that your workers are not killed by a health check. A long request may block the health check request, and the worker gets killed by your platform because the platform thinks that the worker is unresponsive.
E.g. if you have a 25-second-long request, and a liveness check is configured to hit a different endpoint in the same service every 10 seconds, time out in 1 second, and retry 3 times, this gives 10+1*3 ~ 13 seconds, and you can see that it would trigger some times but not always.
The solution, if this is your case, is to reconfigure your liveness check (or whatever health check mechanism your platform uses) so it can wait until your typical request finishes. Or allow for more threads - something that makes sure that the health check is not blocked for long enough to trigger worker kill.
You can see that adding more workers may help with (or hide) the problem.

The easiest way that worked for me is to create a new config.py file in the same folder where your app.py exists and to put inside it the timeout and all your desired special configuration:
timeout = 999
Then just run the server while pointing to this configuration file
gunicorn -c config.py --bind 0.0.0.0:5000 wsgi:app
note that for this statement to work you need wsgi.py also in the same directory having the following
from myproject import app
if __name__ == "__main__":
app.run()
Cheers!

Apart from the gunicorn timeout settings which are already suggested, since you are using nginx in front, you can check if these 2 parameters works, proxy_connect_timeout and proxy_read_timeout which are by default 60 seconds. Can set them like this in your nginx configuration file as,
proxy_connect_timeout 120s;
proxy_read_timeout 120s;

In my case I came across this issue when sending larger(10MB) files to my server. My development server(app.run()) received them no problem but gunicorn could not handle them.
for people who come to the same problem I did. My solution was to send it in chunks like this:
ref / html example, separate large files ref
def upload_to_server():
upload_file_path = location
def read_in_chunks(file_object, chunk_size=524288):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(upload_file_path, 'rb') as f:
for piece in read_in_chunks(f):
r = requests.post(
url + '/api/set-doc/stream' + '/' + server_file_name,
files={name: piece},
headers={'key': key, 'allow_all': 'true'})
my flask server:
#app.route('/api/set-doc/stream/<name>', methods=['GET', 'POST'])
def api_set_file_streamed(name):
folder = escape(name) # secure_filename(escape(name))
if 'key' in request.headers:
if request.headers['key'] != key:
return 404
else:
return 404
for fn in request.files:
file = request.files[fn]
if fn == '':
print('no file name')
flash('No selected file')
return 'fail'
if file and allowed_file(file.filename):
file_dir_path = os.path.join(app.config['UPLOAD_FOLDER'], folder)
if not os.path.exists(file_dir_path):
os.makedirs(file_dir_path)
file_path = os.path.join(file_dir_path, secure_filename(file.filename))
with open(file_path, 'ab') as f:
f.write(file.read())
return 'sucess'
return 404

in case you have changed the name of the django project you should also go to
cd /etc/systemd/system/
then
sudo nano gunicorn.service
then verify that at the end of the bind line the application name has been changed to the new application name

MySQL driver segfaulting under mod_perl - where to look for issue

I have a webapp that segfaults when the database in restarted and it tries to use the old connections. Running it under gdb --args apache -X leads to the following output:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1212868928 (LWP 16098)]
0xb7471c20 in mysql_send_query () from /usr/lib/libmysqlclient.so.15
I've checked that the drivers and database are all up to date (DBD::mysql 4.0008, MySQL 5.0.32-Debian_7etch6-log).
Annoyingly I can't reproduce this with a trivial script:
use DBI;
use Test::More tests => 2;
my $dbh = DBI->connect( "dbi:mysql:test", 'root' );
sub test_db {
my ($number) = $dbh->selectrow_array("select 1 ");
return $number;
}
is test_db, 1, "connected to db";
warn "restart db now";
getc;
is test_db, 1, "connected to db";
Which gives the following:
ok 1 - connected to db
restart db now at dbd-mysql-test.pl line 23.
DBD::mysql::db selectrow_array failed: MySQL server has gone away at dbd-mysql-test.pl line 17.
not ok 2 - connected to db
# Failed test 'connected to db'
# at dbd-mysql-test.pl line 26.
# got: undef
# expected: '1'
This behaves correctly, telling me why the request failed.
What stumps me is that it is segfaulting, which it shouldn't do. As it only appears to happen when the whole app is running (which uses DBIx::Class) it is hard to reduce it to a test case.
Where should I start to look to debug this? Has anyone else seen this?
UPDATE: further prodding showed that it being under mod_perl was a red herring. Having reduced it to a simple test script I've now posted to the DBI mailing list. Thanks for your answers.

What this probably means is that there's a difference between your mod_perl environment and the one you were testing via your script. Some things to check:
Was your mod_perl compiled with the same version of Perl
Are the #INC's the same for both
Are you using threads in your mod_perl setup? I don't believe DBD::mysql is completely thread-safe.

I've seen this problem, but I'm not sure it had the same cause as yours. Are you by chance using a certain module for sending mails (forgot the name, sorry) from your application? When we had the problem in a project, after days of debugging we found that this mail module was doing strange things with open file descriptors, then forked off another process which called the console tool sendmail, which again did strange things with file descriptors. I guess one of the file descriptors it messed around with was the connection to the database, but I'm still not sure about that. The problem disappeared when we switched to another module for sending mails. Maybe it's worth a look for you too.

If you're getting a segfault, do you have a core file greated? If not, check ulimit -c. If that returns 0, your system won't create core files and you'll have to change that. If you do have a core file, you can use gdb or similar tools to debug it. It's not particularly fun, but it's possible. The start of the command will look something like:
gbd /usr/bin/httpd core
There are plenty of tutorials for debugging core files scattered about the Web.
Update: Just found a reference for ensuring you get core dumps from mod_perl. That should help.

This is a known problem in old DBD::mysql. Upgrade it (4.008 is not up to date).
There's a simple test script attached to https://rt.cpan.org/Public/Bug/Display.html?id=37027
that will trigger this bug.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008