Can not connect to compute engine VM - No space left on device - google-compute-engine

I've run into an issue where I can no longer connect to my VM that I could connect to before.
Checking the VMs Observability Tab as well as the servers serial port output shows:
Aug 2 09:27:14 my-server systemd[1]: systemd-logind.service: Failed to run 'start' task: No space left on device
Aug 2 09:27:14 my-server systemd[1]: systemd-logind.service: Failed with result 'resources'.
Aug 2 09:27:14 my-server systemd[1]: Failed to start Login Service.
Aug 2 09:27:14 my-server systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 1.
So the likely culprit is simply, that the disk is full.
I've found this troubleshooting guide which I followed,by:
Shutting down the VM.
Increasing the Boot Disk size from 1000 to 1400 and 1450 in a second attempt.
Starting the VM.
Unfortunately that didn't help and I still can't connect. Further steps require me to connect via SSH. It might be worth noting, that I did attempt to increase the size through the web interface first before I found the troubleshooting page. At that point the VM might or might not have been shut down properly. Another change I noticed is, that the VM does no longer report Disk Space Utilization or Memory Utilization in the Observability Tab since it was restarted, however it sill reports other stats.
Under File system issues a it is mentioned that under Debian images ... expand-root.sh[..]: Resizing ext4 filesystem on /dev/sda1. should appear. While this is an Ubuntu image, I couldn't find anything similar using grep.
So, how could I gain access to the stuff on the VM?

I would suggest to:
Create a snapshot from the affected disk.
Create an additional disk using the created snapshot
After creating the disk from the snapshot, attach it to an another working Linux VM Instance "instance-1" for example, as additional disk, not as the boot disk
Then mount the attached disk to a specific directory.
After mounting, You need to free up some space from the mounted disk, to have space for the services needed to recognize the resizing of the disk.
After freeing up some space, you need to re-attach the disk as the boot disk to test if you had regain control of the VM Instance.

Related

How to track disk usage on Container-Optimized OS

I have an application running on Container-Optimized OS based Compute Engine.
My application runs every 20min, fetches and writes data to a local file, then deletes the file after some processing. Note that each file is less than 100KB.
My boot disk size is the default 10GB.
I run into "no space left on device" error every month or so while attempting to write the file locally.
How can I track disk usage?
I manually checked the size of the folders and it seems that the bulk of the space is taken by /mnt/stateful_partition/var/lib/docker/overlay2.
my-vm / # sudo du -sh /mnt/stateful_partition/var/lib/docker/*
20K /mnt/stateful_partition/var/lib/docker/builder
72K /mnt/stateful_partition/var/lib/docker/buildkit
208K /mnt/stateful_partition/var/lib/docker/containers
4.4M /mnt/stateful_partition/var/lib/docker/image
52K /mnt/stateful_partition/var/lib/docker/network
1.6G /mnt/stateful_partition/var/lib/docker/overlay2
20K /mnt/stateful_partition/var/lib/docker/plugins
4.0K /mnt/stateful_partition/var/lib/docker/runtimes
4.0K /mnt/stateful_partition/var/lib/docker/swarm
4.0K /mnt/stateful_partition/var/lib/docker/tmp
4.0K /mnt/stateful_partition/var/lib/docker/trust
28K /mnt/stateful_partition/var/lib/docker/volumes
TL;DR: Use Stackdriver Monitoring and create an alert for DISK usage.
Since you are using COS images, you can enable Stackdriver Monitoring agent by simply adding the “google-monitoring-enabled” label set to “true” on GCE Instance metadata. To do so, run the command:
gcloud compute instances add-metadata instance-name --metadata=google-monitoring-enabled=true
Replace instance-name with the name of your instance. Remember to restart your instance to get the change done. You don't need to install the Stackdriver Monitoring agent since is already installed by default in COS images.
Then, you can use disk usage metric to get the usage of your disk.
You can create an alert to get a notification each time the usage of the partition reaches a certain threshold.
Since you are in a cloud, it is always the best idea to use the Cloud resources to solve Cloud issues.
Docker uses /var/lib/docker to store your images, containers, and local named volumes. Deleting this can result in data loss and possibly stop the engine from running. The overlay2 subdirectory specifically contains the various filesystem layers for images and containers.
To cleanup unused containers and images via command:
docker system prune.
Monitor it via command "watch"
sudo watch "du -sh /mnt/stateful_partition/var/lib/docker/*"

Google compute engine, instance dead? How to reach?

I have a small instance running in GCE, had some troubles with the MongoDb so after some tries decided to reset the instance. But... it didn't seem to come back online. So i stopped the instance and restarted it.
It is an Bitnami MEAN stack which starts apache and stuff at startup.
But... i can't reach the instance! No SCP, no SSH, no webservice running. When i try to connect via SSH (in GCE) it times out, cant make connection on port 22. In the information it says 'The instance is booting up and sshd is not running yet', which is possible of course.... But i cant reach the instance in no possible manner not even after an hour wait :) Not sure what's happening if i cant connect to it somehow :(
There is some activity in the console... some CPU usage, mostly 0%, some incomming traffic but no outgoing...
I hope someone can give me a hint here!
Update 1
After the helpfull tip form Serhii... if found this in the logs...
Booting from Hard Disk 0...
[ 0.872447] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1: Inodes that were part of a corrupted orphan linked list found.
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
fsck exited with status code 4
The root filesystem on /dev/sda1 requires a manual fsck
Update 2...
So, i need to fsck the drive...
Created a snapshot, made a new disk from that snapshot, added the new disk as an extra disk to another instance. Now that instance wont boot with the same problem... removing the extra disk fixed it again. So adding the disk makes it crash even though it isn't the boot-disk?
First, have a look at the Compute Engine -> VM instances -> NAME_OF_YOUR_VM -> Logs -> Serial port 1 (console) and try to find errors and warnings that could be connected to lack of free space or SSH. It'll be helpful if you updated your post by providing this information. In case if your instance run out of free space follow this instructions.
You can try to connect to your VM via Serial console by following this guide, but keep in mind that:
The interactive serial console does not support IP-based access
restrictions such as IP whitelists. If you enable the interactive
serial console on an instance, clients can attempt to connect to that
instance from any IP address.
more details you can find in the documentation.
Have a look at the Troubleshooting SSH guide and Known issues for SSH in browser. In addition, Google provides a troubleshooting script for Compute Engine to identify issues with SSH login/accessibility of your Linux based instance.
If you still have a problem try to use your disk on a new instance.
EDIT It looks like your test VM is trying to boot from the disk that you created from the snapshot. Try to follow this guide.
If you still have a problem, you can try to recreate the boot disk from a snapshot to resize it.

mysql docker container crashes often

I am using mariadb and wordpress container. But this error keeps on happening. How can I ensure that this crash does not happen anymore ? Am I being attacked ? Or is it a problem that occurs to other people ? How can I attach to mariadb and have access to the shell and try to find out what goes on inside mariadb container?
See below the messages logged after every crash... There seems to be a high number of page hits as well. page visits go up to 20.000 to 60.000 hits on pages. These seem to be the work of crawlers, bots. Not sure if these are malicious attacks.
Any help on how to go about dealing with this problem?
I have mariadb, wordpress and phpmyadmin working in three docker containers under ubuntu 14 on digital ocean.
Here are the crash messages:
[1668002.926214] Out of memory: Kill process 16765 (mysqld) score 176 or sacrifice child [1668002.935614] killed process 16765 (mysqld) total-vm:1012836kb, anon-rss:178840kb, file-rss:0kb
[1668040.992415] killed process 22570 (php5-fpm) total-vm:418044kB, anon-rss:145392kB, file-rss: 20624kB
New York Server:
[1225007.977126] Out of memory: Kill process 3161 (mysqld) score 245 or sacrifice child [
1225007.985657] killed process 3161 (mysqld) total-vm: 977148kb, anon-rss:122488kb, file-rss:0kB)
Frankfurt server
[1632264.057873] Out of memory: Kill process 22421 (mysqld) score 246 or sacrifice child
[1632264.067530] Killed process 22421 (mysqld) total-vm: 1005228kb, anon-rss:249328kb, file-rss:0kb
The official MySQL images on Docker Hub use a configuration that's recommended by MySQL. Basically, the default configuration is tuned for performance, and is intended for running MySQL on a dedicated server, with lots of memory (multiple gigabytes).
Tune MySQL settings based on your requirements and available resources
When running MySQL in a container on a small DigitalOcean droplet (512MB, 1GB), you will have to modify the default settings to fit your situation. For example; limit the maximum amount of simultaneous connections, less query cache, etc.
Also note that, by default, DigitalOcean droplets don't have swap configured, which means that if they run out of memory, they cannot use the SSD to swap. It's important to configure Swap on those droplets so that MySQL doesn't crash if it's temporarily needing more memory (e.g. when re-idexing the database).
This article describes how to configure a Swap partition on Ubuntu 14.04 on DigitalOcean; How To Add Swap on Ubuntu 14.04
The following issues on the official MySQL Docker repository contain some hints for tuning MySQL settings for "performance" or "memory efficiency";
mysql immediately stops after running
Container with MySQL crash almost every day
The MySQL readme on Docker Hub describes how to use a custom configuration file; "Using a custom MySQL configuration file"

mysqldump increase apache process threads?

We are running into problem with number of apache processes drastically increasing at a specific time. On further investigating, it's found that "mysqldump" was running in MySQL server during that time.
We noticed that while mysqldump was running, the count of apache instances(processes) shoots up to the max. Since we limited MaxClients to 150 there is no further increase in process thread.
My question is: Is it possible that mysqldump would increase number of processes in apache?
MySQLdump itself does not cause Apache to do anything. MySQLdump and Apache Server are only related in that since they are running on the same machine they are sharing the resources of that machine.
I suspect that when you run MySQLdump your server becomes resource-constrained (could be CPU, Network or Disk). When Apache is resource-constrained an individual Apache process is not able to complete as quickly and move on to the next request so instead new processes are spawned.

Restart Mysql automatically when ubuntu on EC2 micro instance kills it when running out of memory

When the system runs out of memory, ubuntu 12.04 kills the mysql process:
Out of memory: Kill process 17074 (mysqld) score 146 or sacrifice child
So the process ends up killed.
This happens at peaks of server load and mainly because of apache getting wild and eating the remaining available memory. Possible approaches could be:
Change somewhere somehow the priority of mysql, so it's not killed (probably a bad fix as something else will be killed)
Monitor the status of mysql and restart automatically whenever it's killed (the one I'm thinking about, but don't know how to do it).
How do you see it?
Abrupt termination of a database server is a very serious crash. You need to avoid this in a production system, because it may not restart cleanly.
The database server is a shared resource, and should almost never terminate in an unplanned fashion in production. The only thing that should cause unplanned termination is a catastrophic hardware or power failure. Most properly configured production data base servers have an unplanned termination once every ten years or less frequently. Seriously.
What to do?
Fix your apache configuration. Limit the number of worker threads and processes it can use, so it can't run wild. Learn how to do this. It's vital. See here: http://httpd.apache.org/docs/current/mod/mpm_common.html#maxrequestworkers
Fix the defects in your web app that are causing your apache to run wild.
If you can, move your mysqld server to a different server machine from apache, so the two don't contend for the same hardware resources.
Configure your mysqld to limit the number of connections it will accept from apache worker threads or other clients. Your web app probably handles the situation where a worker thread needs to wait for a connection. See here. http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_max_connections
Are you on an EC2 micro instance? You need to do some serious tuning. See here: http://ubuntuforums.org/showthread.php?t=1979049
You can check mysql status every minute (with cron) and restart if it is crashed:
* * * * * service mysql status | grep running || service mysql restart