Google Compute instance won't mount persistent disk, maintains ~100% CPU - google-compute-engine

During some routine use of my web server (saving posts via WordPress), my instance suddenly jumped up to 400% CPU usage and wouldn't come back down below 100%. Restarting and stopping/starting the instance didn't change anything.
Looking at the last bit of my serial output:
[ 0.678602] md: Waiting for all devices to be available before autodetect
[ 0.679518] md: If you don't use raid, use raid=noautodetect
[ 0.680548] md: Autodetecting RAID arrays.
[ 0.681284] md: Scanned 0 and added 0 devices.
[ 0.682173] md: autorun ...
[ 0.682765] md: ... autorun DONE.
[ 0.683716] VFS: Cannot open root device "sda1" or unknown-block(0,0): error -6
[ 0.685298] Please append a correct "root=" boot option; here are the available partitions:
[ 0.686676] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 0.688489] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-30-generic #34~14.04.1-Ubuntu
[ 0.689287] Hardware name: Google Google, BIOS Google 01/01/2011
[ 0.689287] ffffea00008ae400 ffff880024ee7db8 ffffffff817af477 000000000000111e
[ 0.689287] ffffffff81a7c6c0 ffff880024ee7e38 ffffffff817a9338 ffff880024ee7dd8
[ 0.689287] ffffffff00000010 ffff880024ee7e48 ffff880024ee7de8 ffff880024ee7e38
[ 0.689287] Call Trace:
[ 0.689287] [<ffffffff817af477>] dump_stack+0x45/0x57
[ 0.689287] [<ffffffff817a9338>] panic+0xc1/0x1f5
[ 0.689287] [<ffffffff81d3e5f3>] mount_block_root+0x210/0x2a9
[ 0.689287] [<ffffffff81d3e822>] mount_root+0x54/0x58
[ 0.689287] [<ffffffff81d3e993>] prepare_namespace+0x16d/0x1a6
[ 0.689287] [<ffffffff81d3e304>] kernel_init_freeable+0x1f6/0x20b
[ 0.689287] [<ffffffff81d3d9a7>] ? initcall_blacklist+0xc0/0xc0
[ 0.689287] [<ffffffff8179fab0>] ? rest_init+0x80/0x80
[ 0.689287] [<ffffffff8179fabe>] kernel_init+0xe/0xf0
[ 0.689287] [<ffffffff817b6d98>] ret_from_fork+0x58/0x90
[ 0.689287] [<ffffffff8179fab0>] ? rest_init+0x80/0x80
[ 0.689287] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 0.689287] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
(Not sure if it's obvious from that, but I'm using the standard Ubuntu 14.04 image)
I've tried taking snapshots and mounting them on new instances, and now I've even deleted the instance and mounted the disk on to a new one, still the same issue and exactly the same serial output.
I really hope my data has not been hopelessly corrupted. Not sure if anyone has any suggestions on recovering data from a persistent disk?
Note that the accepted answer for: Google Compute Engine VM instance: VFS: Unable to mount root fs on unknown-block did not work for me.

I posted this on another question, but this question is worded better, so I'll re-post it here.
What Causes This?
That is the million dollar question. After inspecting my GCE VM, I found out there were 14 different kernels installed taking up several hundred MB's of space. Most of the kernels didn't have a corresponding initrd.img file, and were therefore not bootable (including 3.19.0-39-generic).
I certainly never went around trying to install random kernels, and once removed, they no longer appear as available upgrades, so I'm not sure what happened. Seriously, what happened?
Edit: New response from Google Cloud Support.
I received another disconcerting response. This may explain the additional, errant kernels.
"On rare occasions, a VM needs to be migrated from one physical host to another. In such case, a kernel upgrade and security patches might be applied by Google."
How to recover your instance...
After several back-and-forth emails, I finally received a response from support that allowed me to resolve the issue. Be mindful, you will have to change things to match your unique VM.
Take a snapshot of the disk first in case we need to roll back any of the changes below.
Edit the properties of the broken instance to disable this option: "Delete boot disk when instance is deleted"
Delete the broken instance.
IMPORTANT: ensure not to select the option to delete the boot disk. Otherwise, the disk will get removed permanently!!
Start up a new temporary instance.
Attach the broken disk (this will appear as /dev/sdb1) to the temporary instance
When the temporary instance is booted up, do the following:
In the temporary instance:
# Run fsck to fix any disk corruption issues
$ sudo fsck.ext4 -a /dev/sdb1
# Mount the disk from the broken vm
$ sudo mkdir /mnt/sdb
$ sudo mount /dev/sdb1 /mnt/sdb/ -t ext4
# Find out the UUID of the broken disk. In this case, the uuid of sdb1 is d9cae47b-328f-482a-a202-d0ba41926661
$ ls -alt /dev/disk/by-uuid/
lrwxrwxrwx. 1 root root 10 Jan 6 07:43 d9cae47b-328f-482a-a202-d0ba41926661 -> ../../sdb1
lrwxrwxrwx. 1 root root 10 Jan 6 05:39 a8cf6ab7-92fb-42c6-b95f-d437f94aaf98 -> ../../sda1
# Update the UUID in grub.cfg (if necessary)
$ sudo vim /mnt/sdb/boot/grub/grub.cfg
Note: This ^^^ is where I deviated from the support instructions.
Instead of modifying all the boot entries to set root=UUID=[uuid character string], I looked for all the entries that set root=/dev/sda1 and deleted them. I also deleted every entry that didn't set an initrd.img file. The top boot entry with correct parameters in my case ended up being 3.19.0-31-generic. But yours may be different.
# Flush all changes to disk
$ sudo sync
# Shut down the temporary instance
$ sudo shutdown -h now
Finally, detach the HDD from the temporary instance, and create a new instance based off of the fixed disk. It will hopefully boot.
Assuming it does boot, you have a lot of work to do. If you have half as many unused kernels as me, then you might want to purge the unused ones (especially since some are likely missing a corresponding initrd.img file).
I used the second answer (the terminal-based one) in this askubuntu question to purge the other kernels.
Note: Make sure you don't purge the kernel you booted in with!

In order to recover your data, you need to create a brand new instance where you can ssh, and attach the corrupted disk to it as a secondary disk. More information can be found in this article. I would suggest taking a snapshot of the corrupted disk before attaching it, for backup purposes.

Related

How to run several IPFS nodes on a single machine?

For testing, I want to be able to run several IPFS nodes on a single machine.
This is the scenario:
I am building small services on top of IPFS core library, following the Making your own IPFS service guide. When I try to put client and server on the same machine (note that each of them will create their own IPFS node), I will get the following:
panic: cannot acquire lock: Lock FcntlFlock of /Users/long/.ipfs/repo.lock failed: resource temporarily unavailable
Usually, when you start with IPFS, you will use ipfs init, which will create a new node. The default data and config stored for that particular node are located at ~/.ipfs. Here is how you can create a new node and config it so it can run besides your default node.
1. Create a new node
For a new node you have to use ipfs init again. Use for instance the following:
IPFS_PATH=~/.ipfs2 ipfs init
This will create a new node at ~/.ipfs2 (not using the default path).
2. Change Address Configs
As both of your nodes now bind to the same ports, you need to change the port configuration, so both nodes can run side by side. For this, open ~/.ipfs2/configand findAddresses`:
"Addresses": {
"API": "/ip4/127.0.0.1/tcp/5001",
"Gateway": "/ip4/127.0.0.1/tcp/8080",
"Swarm": [
"/ip4/0.0.0.0/tcp/4001",
"/ip6/::/tcp/4001"
]
}
To for example the following:
"Addresses": {
"API": "/ip4/127.0.0.1/tcp/5002",
"Gateway": "/ip4/127.0.0.1/tcp/8081",
"Swarm": [
"/ip4/0.0.0.0/tcp/4002",
"/ip6/::/tcp/4002"
]
}
With this, you should be able to run both node .ipfs and .ipfs2 on a single machine.
Notes:
Whenever you use .ipfs2, you need to set the env variable IPFS_PATH=~/.ipfs2
In your example you need to change either your client or server node from ~/.ipfs to ~/.ipfs2
you can also start the daemon on the second node using IPFS_PATH=~/.ipfs2 ipfs daemon &
Hello, I use ipfs2, after running two daemons at the same time, can indeed open localhost:5001 / webui, run the second localhost:5002 / webui has an error, as shown in the attachment
Here are some ways I've used to create multiple nodes/peers ids.
I use windows 10.
1st node go-ipfs (latest version)
2nd node Siderus Orion ifps (connect to Orion node , not local) -- https://orion.siderus.io/
Use VirtualBox to run a minimal ubuntu installation. (You can set up as many as you want)
Repeat the process and you have 4 nodes or as many as you want.
https://discuss.ipfs.io/t/ipfs-manager-download-install-manage-debug-your-ipfs-node/3534 is another gui that installs and lets you manage all ipfs commands without CMD. He just released it a few days ago and it looks well worth lots of reviews.
Disclaimer I am not a coder or computer professional. Just a huge fan of IPFS! I hope we can raise awareness and change the world.

Increasing the size of the queue in Elasticsearch?

Ive been looking at my elasticsearch logs, and I came across the error
rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#6d32fa18
After looking up the error, the general and consensus was to increase the size of the queue as talked about here - https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
The question I have is how to do I actually do this? Is there aconfiguration file somewhere that I am missing?
From Elasticsearch 5 onward you cannot use the API to update the threadpool search queue size. It is now a node-level settings. See this.
Thread pool settings are now node-level settings. As such, it is not possible to update thread pool settings via the cluster settings API.
To update the threadpool you have to add thread_pool.search.queue_size : <New size> in elasticsearch.yml file of each node and then restart elasticsearch.
To change the queue size one could add it in the config file for each of the nodes as follows:
threadpool.search.queue_size: <new queue size> .
However this would also require a cluster restart.
Up to Elasticsearch 2.x, you can update via the cluster-setting api and this would not require a cluster restart, however this option is gone with Elasticsearch 5.x and newer.
curl -XPUT _cluster/settings -d '{
"persistent" : {
"threadpool.search.queue_size" : <new_size>
}
}'
You can query the queue size as follows:
curl <server>/_cat/thread_pool?v&h=search.queueSize
As of Elasticsearch 6 the type of the search thread pool has changed to fixed_auto_queue_size, which means setting threadpool.search.queue_size in elasticsearch.yml is not enough, you have to control the min_queue_size and max_queue_size parameters as well, like this:
thread_pool.search.queue_size: <new_size>
thread_pool.search.min_queue_size: <new_size>
thread_pool.search.max_queue_size: <new_size>
I recommend using _cluster/settings?include_defaults=true to view the current thread pool settings in your nodes before making any changes. For more information about the fixed_auto_queue_size thread pool read the docs.

Unable to access Google Compute Engine instance using external IP address

I have a Google compute engine instance(Cent-Os) which I could access using its external IP address till recently.
Now suddenly the instance cannot be accessed using its using its external IP address.
I logged in to the developer console and tried rebooting the instance but that did not help.
I also noticed that the CPU usage is almost at 100% continuously.
On further analysis of the Serial port output it appears the init module is not loading properly.
I am pasting below the last few lines from the serial port output of the virtual machine.
rtc_cmos 00:01: RTC can wake from S4
rtc_cmos 00:01: rtc core: registered rtc_cmos as rtc0
rtc0: alarms up to one day, 114 bytes nvram
cpuidle: using governor ladder
cpuidle: using governor menu
EFI Variables Facility v0.08 2004-May-17
usbcore: registered new interface driver hiddev
usbcore: registered new interface driver usbhid
usbhid: v2.6:USB HID core driver
GRE over IPv4 demultiplexor driver
TCP cubic registered
Initializing XFRM netlink socket
NET: Registered protocol family 17
registered taskstats version 1
rtc_cmos 00:01: setting system clock to 2014-07-04 07:40:53 UTC (1404459653)
Initalizing network drop monitor service
Freeing unused kernel memory: 1280k freed
Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 800k freed
Freeing unused kernel memory: 1584k freed
Failed to execute /init
Kernel panic - not syncing: No init found. Try passing init= option to kernel.
Pid: 1, comm: swapper Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Call Trace:
[] ? panic+0xa7/0x16f
[] ? init_post+0xa8/0x100
[] ? kernel_init+0x2e6/0x2f7
[] ? child_rip+0xa/0x20
[] ? kernel_init+0x0/0x2f7
[] ? child_rip+0x0/0x20
Thanks in advance for any tips to resolve this issue.
Mathew
It looks like you might have an script or other program that is causing you to run out of Inodes.
You can delete the instance without deleting the persistent disk (PD) and create a new vm with a higher capacity using your PD, however if it's an script causing this, you will end up with the same issue. It's always recommended to backup your PD before making any changes.
Run this command to find more info about your instance:
gcutil --project= getserialportoutput
If the issue still continue, you can either
- Make a snapshot of your PD and make a PD's copy or
- Delete the instance without deleting the PD
Attach and mount the PD to another vm as a second disk, so you can access it to find what is causing this issue. Visit this link https://developers.google.com/compute/docs/disks#attach_disk for more information on how to do this.
Visit this page http://www.ivankuznetsov.com/2010/02/no-space-left-on-device-running-out-of-inodes.html for more information about inodes troubleshooting.
Make sure the Allow HTTP traffic setting on the vm is still enabled.
Then see which network firewall you are using and it's rules.
If your network is set up to use an ephemral IP, it will be periodically released back. This will cause your IP to change over time. Set it to static/reserved then (on networks page).
https://developers.google.com/compute/docs/instances-and-network#externaladdresses

google compute engine mounting persistant disk issues

I am following this guide https://developers.google.com/compute/docs/troubleshooting#ssherrors specifically the section about recovering your persistent disk with another vm.
I am trying to follow this part:
mount /dev/disk/by-id/scsi-0Google_PersistentDisk_myinstance-debugging /mnt/myinstance
This is the error I get:
root#debugger:~# mount /dev/disk/by-id/scsi-0Google_PersistentDisk_marty-wll-debugging /mnt/marty-wll
mount: you must specify the filesystem type
I am unsure of the filesystem due to google-compute disks being used, and the system has already been deleted and attached to another machine following the google developers guide I referenced above.
parted scsi-0Google_PersistentDisk_marty-wll-debugging -l
root#debugger:/dev/disk/by-id# parted scsi-0Google_PersistentDisk_marty-wll-debugging -l
Model: Google PersistentDisk (scsi)
Disk /dev/sda: 10.7GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 10.7GB 10.7GB primary ext4
Model: Google PersistentDisk (scsi)
Disk /dev/sdb: 10.7GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 10.7GB 10.7GB primary ext4
gave me the information that its "ext4".
although when I issue the following command I still get an error:
root#debugger:~# mount -t ext4 /dev/disk/by-id/scsi-0Google_PersistentDisk_marty-wll-debugging /mnt/marty-wll
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
dmesg of syslog said :
[ 2452.205447] EXT4-fs (sdb): VFS: Can't find ext4 filesystem
any ideas?
Thanks for pointing this out, I will update the docs. Try adding -part1 to the end of your device name. This will mount the partition, instead of the disk. For your specific case:
mount /dev/disk/by-id/scsi-0Google_PersistentDisk_myinstance-debugging-part1 /mnt/myinstance
Also, there are cleaner aliases, so this should work as well:
mount /dev/disk/by-id/google-myinstance-debugging-part1 /mnt/myinstance

Starting instance again after power off

How do I start instance on GCE again after power off.
Instance shows TERMINATED , but has PERSISTENT disk type.
if I use add instance with the same instance name it asks me for the
Select an new image with only choice of OS level, not my existing disk.
then fails with
ERROR: RESOURCE_ALREADY_EXISTS: The resource XXXX already exists
Is there way to start (or clone) copy of image once stopped?
Anything similar to AWS stop/start. I don't care about instance state or scratch to be saved, just start since I have boot disk stored and payed for.
Success, below is stop/start procedure, assuming that $PROJECT and $INSTANCE are set appropriately:
#--------- stop instance -----
#connect and shutdown
gcutil --project=$PROJECT ssh $INSTANCE
sudo shutdown -h now
# check
gcutil listinstances --project $PROJECT
#delete instance/keep boot disk , use -f to avoid confirmation
gcutil --project=$PROJECT deleteinstance $INSTANCE --nodelete_boot_pd
# check disks
gcutil listdisks --project=$PROJECT
#--------- start new instance -----
# launch instance using the existing disk (has to be in the same zone!)
gcutil --project=$PROJECT addinstance $INSTANCE --disk=$DISK,boot --zone=$ZONE --machine_type=n1-standard-1
#check that it's running
gcutil listinstances --project $PROJECT
You're on the right track. You just need to delete the existing TERMINATED instance before adding it again.
Even though the instance isn't running when it is TERMINATED, the resources (such as Persistent Disk) are still allocated to it.
Also, if this instance was created before December 5th, (when Compute Engine went GA), you'll need to add a kernel to the disk or it won't boot. See the transition guide for details.
(For a temporary work around to upgrading the kernel, see this Q/A: My Google Compute Engine instances hang during boot using the v1 API)