How can I make my qpid queues persistent? - qpid

We have recently installed version 1.39.0 of the qpid C++ broker on a CentOS 7 server. The following RPMs have been installed:
pmena#server=> rpm -qa | grep qpid
qpid-proton-c-0.33.0-1.el7.x86_64
qpid-cpp-client-1.39.0-1.el7.x86_64
qpid-tests-1.37.0-5.el7.noarch
python2-qpid-1.37.0-5.el7.noarch
qpid-qmf-1.39.0-1.el7.x86_64
qpid-tools-1.39.0-1.el7.noarch
qpid-cpp-server-1.39.0-1.el7.x86_64
python2-qpid-qmf-1.39.0-1.el7.x86_64
qpid-cpp-client-docs-1.39.0-1.el7.noarch
We can add queues with the durable attribute, but after stopping and restarting qpidd, the queues disappear. When restoring the queues via the qpid-config add queue command, any statistical information associated with the queue is lost. Why are the queues - and their associated statistics - not persisting between restarts?

It seems that adding the qpid-cpp-server-linearstore-1.39.0-1.el7.x86_64 package from the CentOS repo gave us the functionality we were seeking. We were able to test this by creating some test traffic, observing the increased queue message and byte counts, and then restarting qpid. The queue message and byte counts were intact.

Related

Cloud Run sending SIGTERM with no visible scale down on container instances

I've deployed a Python FastAPI application on Cloud Run using Gunicorn + Uvicorn workers.
Cloud Run configuration:
Dockerfile
FROM python:3.8-slim
# Allow statements and log messages to immediately appear in the Knative logs
ENV PYTHONUNBUFFERED True
ENV PORT ${PORT}
ENV APP_HOME /app
ENV APP_MODULE myapp.main:app
ENV TIMEOUT 0
ENV WORKERS 4
WORKDIR $APP_HOME
COPY ./requirements.txt ./
# Install production dependencies.
RUN pip install --no-cache-dir --upgrade -r /app/requirements.txt
# Copy local code to the container image.
COPY . ./
# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
# Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.
CMD exec gunicorn --bind :$PORT --workers $WORKERS --worker-class uvicorn.workers.UvicornWorker --timeout $TIMEOUT $APP_MODULE --preload
My application receives a requests and does the following:
Makes async call to cloud-firestore using firestore.AsyncClient
Runs an algorithm using Google OR-Tools. I've used a Cprofiler to check that this task on average takes < 500 ms to complete.
Adds a FastAPI async Background Task to write to BigQuery. This is achieved as follows:
from fastapi.concurrency import run_in_threadpool
async def bg_task():
# create json payload
errors = await run_in_threadpool(lambda: client.insert_rows_json(table_id, rows_to_insert)) # Make an API request.
I have been noticing intermittent Handling signal: term logs which causes Gunicorn to shut down processes and restart them. I can't get my head around as to why this might be happening. And the surprising bit is that this happens sometimes at off-peak hours when the API is receiving 0 requests. There doesn't seem to be any apparent scaling down of Cloud Run instances to be causing this issue either.
Issue is, this also happens quite frequently during production load to my API during peak hours - and even causes Cloud Run to autoscale from 2 to 3/4 instances. This adds cold start times to my API. My API receives on average 1 reqs/minute.
Cloud Run metrics during random SIGTERM
As clearly shown here, my API has not been receiving any requests in this period and Cloud Run has no business killing and restarting Gunicorn processes.
Another startling issue is that this seems to only happen in my production environment. In my development environment, I have the exact SAME setup but I don't see any of these issues there.
Why is Cloud Run sending SIGTERM and how do I avoid it?
Cloud Run is a serverless platform, that means the server management is done by Google Cloud and it can choose to stop some instance time to time (for maintenance reason, for technical issue reason,...).
But it changes nothing for you, of course a cold start but it should be invisible for your process, even in high load, because you have a min-instance param to 2 that keep the instance up and ready to serve the traffic without cold start.
Can you have 3 or 4 instances in parallel, instead of 2 (min value)? Yes, but the Billable instance is flat to 2. Cloud Run, again, is serverless, it can create instances to backup and be sure that the future shut down of some won't impact your traffic. It's an internal optimization. No addition cost, it just works well!
Can you avoid that? No, because it's serverless, and also because there no impact on your workloads.
Last point about "environment". For Google Cloud, all the project are production projects. No difference, google can't know what is critical or not, therefore all is critical.
If you note difference between 2 projects it's simply because your projects are deployed on different Google Cloud internal clusters. The status, performances, maintenance operations (...) are different between clusters. And again, you can't do anything for that.

GCE Windows Server gets auto shut down

My Windows Server instance on GCE is shut down from time to time. Based on the GCP logging, we can tell that fail to pass the lateBootReportEvent check only triggers a reboot by a certain chance. I am wondering why?
logs screenshot
I am aware that auto-shutdown is caused by integrity monitoring (settings shown below). And I understand that my boot integrity might fail here. I am just trying to understand why there is a "probability" here
Shielded-VM settings
The integrity monitor and shielded VMs don't have any relation with a VM restart or shutdown.
Integrity monitoring only compares the most recent boot measurements to the integrity policy baseline and returns a pair of pass/fail results depending on whether they match or not, one for the early boot sequence and one for the late boot sequence.
Early boot is the boot sequence from the start of the UEFI firmware until it passes control to the bootloader. Late boot is the boot sequence from the bootloader until it passes control to the operating system kernel. If either part of the most recent boot sequence doesn't match the baseline, you get an integrity validation failure.
If the failure is expected, for example if you applied some system update on that VM instance, you should update the integrity policy baseline. If it is not expected, you should stop that VM instance and investigate the reason for the failure, but the VM never be shutdown by integrity monitor .
In order to determine what actually caused the VM to restart you will need to look at the internal Windows event manager logs, and review the event viewer logs for the instance at time to shutdown, then reference the shutdown reason against Microsoft's reason codes to determine what caused the VM stop.
It is possible that the instance restarted to complete installation of updates, or encountered an internal error. However only the event viewer logs will determine the true cause.
If you found a useful internal logs please share on this post to check.

Laravel 5.4 queue:restart on windows?

I am learning laravel 5.4 "queues" chapter. I have a problem about queue:restart command. Because when I test it on my windows 10 platform, I found this command seems just kill queue worker, but not restart worker. So I wonder whether this command does not work on windows or this command is just kill worker but not restart worker? Thanks.
The queue:restart command never actually restarts a worker, it just tells it to shutdown. It is supposed to be combined with a process manager like supervisor that will restart the process when it quits. This also happens when queue:work hits the configured memory limits.
To keep the queue:work process running permanently in the background, you should use a process monitor such as Supervisor to ensure that the queue worker does not stop running.
Source: https://laravel.com/docs/5.4/queues#running-the-queue-worker

EC2 Instance is running very slow

I am running an EC2 Instance on Ubuntu Server machine. Tomcat and MySQL are installed and deployed java web-application on it since 1 month. It was running good with great performance for almost 1 month but now my application is responding very slow.
Also, point to note is: Earlier when I used to log into my Ubuntu Server through PuTTY, it was quick but now its taking time even when I enter Ubuntu password.
Is there any solution?
I would start with checking with memory/CPU/network availability to check if it is not bottleneck.
Try following commands:
To check memory availability:
free -m
To check CPU usage:
top
To check network usage:
ntop
To check disk usage:
df -h
To check disk io operations:
iotop
Please also check if when you disable your application you are able to quickly log in to that machine. If login is still slow, then you should contact your EC2 support complaining about poor performance and asking for assigning more resources for that machine.
You can use WAIT Tool to diagnose what is wrong with your server or your application. The tool will gather all information about CPU and memoru utilization, running threads etc.
In addition, I would definitely check Tomcat application server with VisualVM or some other profiler. For configuring JMX for Tomcat you can check article here.
For network monitoring - nload tool is worth your attention. You can launch it in screen so you always check network utilization stats when server is slown.
First check is there any application using too much cpu or memory. This can be checked by using top command. I'll tell you two simple shortcut keys that may be helpful while using top command. In top command result page, if you enter M it will sort application based on memory usage, from highest to lowest. If you enter P it will sort application based on cpu usage, from highest to lowest.
If you are unable to find any suspicious application using top you can use iotop it will show disk I/O usage details.
I was facing the same issue, the solution which worked for me was
Restart the ec2 instance
Edit
lately, I figure out this issue is happening due to the fewer resources (memory, CPU) available to the EC2 machine. So check available resources to the EC2 machine.

Integrate different Nagios webservers

I have different sites running with 4 to 5 server at each location. All the locations have one monitoring server with Nagios. Now I want to create a central location and want to combine all the nagios services running at each location. Can anyone please point me to some documentation for these type of jobs.
There are two approaches that you can take.
Install a new Nagios core as you did at each location and perform active checks on each of the remote hosts. You'll likely end up installing NRPE on each of the remote hosts at each location and can read this document for the details: http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf. If your remote servers are Windows servers, you can use NSClient to much of the same things that NRPE does for Linux hosts. This effectively centralizes your monitoring server. I also wrote some how-to style entries for using NRPE to run privileged commands http://blog.gnucom.cc/?p=479 or to run event handlers http://blog.gnucom.cc/?p=458. If you get tired of installing NRPE, you can use my script here http://blog.gnucom.cc/?p=185. I also have instructions to install NSClient here http://blog.gnucom.cc/?p=201.
Install a new Nagios core as you did at each location and perform passive checks by instructing the remote Nagios cores to feed their results to the new central Nagios core's passive command file. I haven't done this myself, so I'm going to point you to the communities documentation here http://nagios.sourceforge.net/docs/2_0/passivechecks.html. You could probably look at my event handler post to set up event handlers that send checks to the main server.
From my personal experience, the first option I mentioned is easier to implement, and is far easy to administer. However, as your server fleet grows you'll start seeing major CPU bottlenecks with the main Nagios core. This is where passive checks would become beneficial, as the main Nagios core simply waits for critical checks to be sent to it rather than having to check them itself.
Hope this helps. :)
A centralized view tool may be what you are looking for. There are a number of different options available.
Nagiosfusion
MK Livestatus
Nagcen
Thruk