Is there a way to get an alert if pm2 auto restarts a process because of max memory usage? - pm2

I would like to setup an alert in my monitoring system to keep track of memory based restarts. I'm using the flags here https://pm2.io/docs/runtime/features/memory-limit/
"max_memory_restart": '3296M',
Ideally i would like to be able to call a shell script or run a node program before the auto restart happens.

Related

How to track disk usage on Container-Optimized OS

I have an application running on Container-Optimized OS based Compute Engine.
My application runs every 20min, fetches and writes data to a local file, then deletes the file after some processing. Note that each file is less than 100KB.
My boot disk size is the default 10GB.
I run into "no space left on device" error every month or so while attempting to write the file locally.
How can I track disk usage?
I manually checked the size of the folders and it seems that the bulk of the space is taken by /mnt/stateful_partition/var/lib/docker/overlay2.
my-vm / # sudo du -sh /mnt/stateful_partition/var/lib/docker/*
20K /mnt/stateful_partition/var/lib/docker/builder
72K /mnt/stateful_partition/var/lib/docker/buildkit
208K /mnt/stateful_partition/var/lib/docker/containers
4.4M /mnt/stateful_partition/var/lib/docker/image
52K /mnt/stateful_partition/var/lib/docker/network
1.6G /mnt/stateful_partition/var/lib/docker/overlay2
20K /mnt/stateful_partition/var/lib/docker/plugins
4.0K /mnt/stateful_partition/var/lib/docker/runtimes
4.0K /mnt/stateful_partition/var/lib/docker/swarm
4.0K /mnt/stateful_partition/var/lib/docker/tmp
4.0K /mnt/stateful_partition/var/lib/docker/trust
28K /mnt/stateful_partition/var/lib/docker/volumes
TL;DR: Use Stackdriver Monitoring and create an alert for DISK usage.
Since you are using COS images, you can enable Stackdriver Monitoring agent by simply adding the “google-monitoring-enabled” label set to “true” on GCE Instance metadata. To do so, run the command:
gcloud compute instances add-metadata instance-name --metadata=google-monitoring-enabled=true
Replace instance-name with the name of your instance. Remember to restart your instance to get the change done. You don't need to install the Stackdriver Monitoring agent since is already installed by default in COS images.
Then, you can use disk usage metric to get the usage of your disk.
You can create an alert to get a notification each time the usage of the partition reaches a certain threshold.
Since you are in a cloud, it is always the best idea to use the Cloud resources to solve Cloud issues.
Docker uses /var/lib/docker to store your images, containers, and local named volumes. Deleting this can result in data loss and possibly stop the engine from running. The overlay2 subdirectory specifically contains the various filesystem layers for images and containers.
To cleanup unused containers and images via command:
docker system prune.
Monitor it via command "watch"
sudo watch "du -sh /mnt/stateful_partition/var/lib/docker/*"

Google compute engine, instance dead? How to reach?

I have a small instance running in GCE, had some troubles with the MongoDb so after some tries decided to reset the instance. But... it didn't seem to come back online. So i stopped the instance and restarted it.
It is an Bitnami MEAN stack which starts apache and stuff at startup.
But... i can't reach the instance! No SCP, no SSH, no webservice running. When i try to connect via SSH (in GCE) it times out, cant make connection on port 22. In the information it says 'The instance is booting up and sshd is not running yet', which is possible of course.... But i cant reach the instance in no possible manner not even after an hour wait :) Not sure what's happening if i cant connect to it somehow :(
There is some activity in the console... some CPU usage, mostly 0%, some incomming traffic but no outgoing...
I hope someone can give me a hint here!
Update 1
After the helpfull tip form Serhii... if found this in the logs...
Booting from Hard Disk 0...
[ 0.872447] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
fsck exited with status code 4
The root filesystem on /dev/sda1 requires a manual fsck
Update 2...
So, i need to fsck the drive...
Created a snapshot, made a new disk from that snapshot, added the new disk as an extra disk to another instance. Now that instance wont boot with the same problem... removing the extra disk fixed it again. So adding the disk makes it crash even though it isn't the boot-disk?
First, have a look at the Compute Engine -> VM instances -> NAME_OF_YOUR_VM -> Logs -> Serial port 1 (console) and try to find errors and warnings that could be connected to lack of free space or SSH. It'll be helpful if you updated your post by providing this information. In case if your instance run out of free space follow this instructions.
You can try to connect to your VM via Serial console by following this guide, but keep in mind that:
The interactive serial console does not support IP-based access
restrictions such as IP whitelists. If you enable the interactive
serial console on an instance, clients can attempt to connect to that
instance from any IP address.
more details you can find in the documentation.
Have a look at the Troubleshooting SSH guide and Known issues for SSH in browser. In addition, Google provides a troubleshooting script for Compute Engine to identify issues with SSH login/accessibility of your Linux based instance.
If you still have a problem try to use your disk on a new instance.
EDIT It looks like your test VM is trying to boot from the disk that you created from the snapshot. Try to follow this guide.
If you still have a problem, you can try to recreate the boot disk from a snapshot to resize it.

gcloud compute: issue command and close terminal

Since it takes time to create snapshot of google compute engine instance, I wonder it is possible to issues gcloud compute disks snapshot command on my local machine and close terminal on local machine without interrupting shapshot creation process?
From the documentation for gcloud compute disks snapshot:
FLAGS
--async
Display information about the operation in progress, without waiting for the operation to complete.
You can run gcloud compute disks snapshot --async and note the operation ID, then run gcloud compute operations describe <OPERATION ID> to check on the operation (you may also have to provide the zone of the operation, which should be the same as the zone of the disk).
Even if you don't use the --async flag, the operation is running asynchronously in the background (gcloud is just staying open until it finishes). If you close the terminal, the snapshot will finish. You'd just need to do some digging to find the operation ID if you're interested in following up on its status.

Why are shutdown scripts not guaranteed?

In the documentation it says that shutdown scripts are only run on a best effort basis and that they are not guaranteed to run. I'm wondering what conditions these would be where they wouldn't run?
Edit
As AndyJ pointed out, the documentation I linked to describes when the script is supposed to be run. To clarify, I have read all of that but it seemed to me that the lack of guarantee to run included the conditions in which it normally is supposed to.
So, to better phrase my question, is the script guaranteed to run in the conditions the documentation says it runs it, or is that only when it should and not necessarily when it will?
They address this in the documentation you linked to.
Shutdown script invocation
Shutdown scripts execute when an instance is scheduled to restart or terminate. There are many ways to restart or terminate an instance but only some actions will trigger the shutdown script to run. A shutdown script will run when:
An instance is deleted using the instances().delete. request. This includes any tools or scripts that use the API, such as the Google Cloud Platform Console and gcloud compute.
An instance is shut down through the console or the instances.stop() method.
An instance is restarted or shut down through a request to the guest operating system, such as a sudo shutdown or sudo reboot.
Note: If your shutdown script requires a network connection, we recommend shutting down your instance using this method because of a known issue with network connectivity loss. The issue primarily affects instances that have been shut down outside of the guest operating system.
The shutdown script will not run if the instance is reset using instances().reset.
Shutdown script running time
When a shutdown script is invoked, it has a limited time period to run, between when the request is made to shut down or restart the instance, to when the instance is actually terminated. During this period, Compute Engine will attempt to run your shutdown script but if the script takes longer than this time period to complete, the instance will automatically terminate and all running tasks are killed. If you shut down or restart an instance by making a request to the guest OS (e.g. running sudo shutdown) the limit does not apply.
In general, we recommend that your shutdown script finishes running within this time period, so that the operating system has time to complete its shutdown, including flushing buffers to disk.
For more information on this time limit, see Shutdown period.

How can I guarantee the w3wp process exists before turning on perfmon logging?

I have a batch script I run before our performance tests that does some pre-test setup on our server; it clears log files, starts the proper services, restores the database, sets some app settings and turns on perfmon logging.
My problem; the w3wp process we need to monitor is not always present at the time we turn on perfmon logging. It's pretty much hit-or-miss if this process is in the log. The test takes anywhere from 4 to 18 hours to complete, and I don't know until the test is done whether or not w3wp was monitored (it doesn't seem that perfmon detects new processes even though my log file is configured to monitor Process(*)), which ends up wasting a lot of time.
Is there a way to force w3wp to get loaded? Is there some command I can call just prior to starting the perfmon logs?
Or, is it possible to configure the perfmon log to monitor processes that may not exist at the time the log is started?
If you install the IIS Admin tools, you can call a command line app called TinyGet. You can pass in any page on your webserver to initialize it. This would start up the process so you can capture it.