Can nova support multiple hypervisor on a single controller? - openstack-nova

In nova.conf, we need to specify computer_driver to either kvm or xen. Dose that mean one nova controller can support only one hypervisor technology? Can I mix kvm and xen virtual machines in the same cloud?

You can mix hypervisors, but on different compute nodes within a cell. You can then use image properties or scheduler hints to make sure vm's based on certain images land on the hypervisor type that they support or to schedule directly to a hypervisor type.

Related

What is the difference between containers and process VMs (NOT system VMs)?

As far as I understand, ...
virtualization, although commonly used to refer to server virtualization, refers to creating virtual versions of any IT component, such as networking and storage
although containerization is commonly contrasted to virtualization, it is technically a form of server virtualization that takes place on the OS level
although virtual machines (VMs) commonly refer to the output of hardware-level server virtualization (system VMs), they can also refer to the output of application virtualization (process VMs), such as JVM
Bearing the above in mind, I am trying to wrap my head around the difference between containers and process VMs (NOT system VMs). In other words, what is the difference between OS-level server virtualization and application virtualization?
Don't both technically refer to one and the same thing: a platform-independent software execution environment that is created using software that abstract the environment’s underlying OS?
Although some say that the isolation achieved by container is a key difference, it is also stated that a system VM "is limited to the resources and abstractions provided by the virtual machine"
I have created a graphic representation for you, it is easier (for me) to explain the differences like this, I hope it helps.
OS-level virtualization aims to run unmodified application for a particular OS. Application can communicate with external world only through OS API, therefore a virtualization component put on that API allows to present different image of external world (e.g. amount of memory, network configuration, process list) to applications running in different virtualization context (container). Generally application runs on "real" CPU (if not already virtualized) and does not need (and sometimes have) to know that environment presented by OS is somehow filtered. It is not platform-independent software execution environment.
On the other hand, application VM aims to run applications that are prepared specially for that VM. For example, a Java VM interpretes a bytecode compiled for a "processor" which has little common with a real CPU. There are CPUs which can run some Java byte code natively, but the general concept is to provide a bytecode effective for software interpretation on different "real" OS platforms. For it to work, JVM has to provide some so called native code to interface with OS API calls it is run on. You can run your program on Sparc, ARM, Intel etc. provided that you have OS-specific intepreter application and your bytecode is conformant to specification.

QEMU riscv32-softmmu, what does the softmmu mean?

Does the 'softmmu' mean that the virtual machine has a single linear address space available to machine and user mode? Or does it have some virtual memory capabilities that are implemented via software and not the underlying processor? Or maybe it means something different entirely?
-softmmu as a suffix in QEMU target names means "complete system emulation including an emulated MMU, for running entire guest OSes or bare metal programs". It is opposed to QEMU's -linux-user mode, which means "emulates a single Linux binary only, translating syscalls it makes into syscalls on the host". Building the foo-softmmu target will give you a qemu-system-foo executable; building foo-linux-user will give you a qemu-foo executable.
So a CPU emulated by -softmmu should provide all the facilities that the real guest CPU's hardware MMU provides, which usually means multiple address spaces which can be configured via the guest code setting up page tables and enabling the MMU.

Type-1 Hypervisors non-volatile memory isolation

Hypervisors isolate different OS running on the same physical machine from each other. Within this definition, non-volatile memory (like hard-drives or flash) separation exists as well.
When thinking on Type-2 hypervisors, it is easy to understand how they separate non-volatile memory because they just use the file system implementation of the underlying OS to allocate different "hard-drive files" to each VM.
But than, when i come to think about Type-1 hypervisors, the problem becomes harder. They can use IOMMU to isolate different hardware interfaces, but in the case of just one non-volatile memory interface in the system I don't see how it helps.
So one way to implement it will be to separate one device into 2 "partitions", and make the hypervisor interpret calls from the VMs and decide whether the calls are legit or not. I'm not keen on communication protocols to non-volatile interfaces but the hypervisor will have to be be familiar with those protocols in order to make the verdict, which sounds (maybe) like an overkill.
Are there other ways to implement this kind of isolation?
Yes you are right, hypervisor will have to be be familiar with those protocols in order to make the isolation possible.
The overhead is mostly dependent on protocol. Like NVMe based SSD basically works on PCIe and some NVMe devices support SR-IOV which greatly reduces the effort but some don't leaving the burden on the hyperviosr.
Mostly this support is configured at build time, like how much memory will be given to each guest, command privilege for each guest etc, and when a guest send a command, hypervisor verifies its bounds and forwards them accordingly.
So why there is not any support like MMU or IOMMU in this case?
There are hundreds of types of such devices with different protocols, NVMe, AHCI etc, and if the vendor came to support these to allow better virtualization, he will end up with a huge chip that's not gonna fit in.

NvLink or PCIe, how to specify the interconnect?

My cluster is equipped with both Nvlink and PCIe. All the GPUs(V100) can communicate directly through both PCIe or NvLink. To my knowledge, both PCIe switch and Nvlink can support the direct link through using CUDA.
Now, I want to compare the peer-to-peer communication performance of PCIe and NvLink. However, I don't know how to specify one, it seems CUDA will always automatically specify one. Could anyone help me?
If two GPUs in CUDA have a direct NVLink connection between them, and you enable Peer-to-Peer transfers, those transfers will flow over NVLink. There is no method of any kind in CUDA to alter this behavior.
If you do not enable Peer-to-Peer transfers, then data transfers (e.g. cudaMemcpy, cudaMemcpyAsync, cudaMemcpyPeerAsync) between those two devices will flow from the source GPU over PCIE to the CPU socket, (perhaps traversing intermediate PCIE switches, perhaps also flowing over a socket-level link such as QPI) and then over PCIE from the CPU socket to the other GPU. At least one CPU socket will always be involved, even if a shorter direct path exists across the PCIE fabric. This behavior is also not modifiable in any fashion available to the programmer.
Both methodologies are demonstrated using the p2pBandwidthLatencyTest CUDA sample code.
The accepted answer -- from an NVIDIA employee -- was correct in 2018. But at some point, NVIDIA added an (undocumented?) option to the driver.
On Linux, you can now put this in /etc/modprobe.d/disable-nvlink.conf:
options nvidia NVreg_NvLinkDisable=1
This will disable NVLink when the driver is next loaded, forcing GPU peer-to-peer communication to use the PCIe interconnect. This gadget exists in driver 515.65.01 (CUDA 11.7.1). I am not sure when it was added.
As for "there is no reason to allow the end-user to choose the slower path", the very existence of this SO question suggests otherwise. In my case, we buy not one server, but dozens... And in the process of choosing our configuration, it is nice to use a single prototype system to benchmark our application using either NVLink or PCIe.

KVM as hypervisor choice in GCE

As per wikipedia, google compute engine uses KVM as hypervisor. I can see mention about vcpu while creating an instance.
Why KVM? Why not VMware OR Xen?
I mean what is the specific reason to choose KVM as a Hypervisor choice?
PS:
Even Xen is a Open source product.
There were a number of factors in the decision, you might not be surprised to learn. :-)
One important factor was compatibility between KVM and existing isolation/scaling processes at Google. (cgroups aka "containers") This allows Google to reuse the same mechanisms that it uses to ensure performance of applications like websearch and gmail to provide consistent performance between VMs scheduled on the machine. This helps GCE avoid noisy neighbor problems.
As you're probably aware, Google has had a long history of Linux kernel development; using KVM allows Google to leverage that talent for GCE. In addition, the hypervisor/hardware emulation split in KVM (where the hypervisor implemented by KVM only emulates a few low-level devices/features, and defers the remaining emulation the the process that opens /dev/kvm) allows for development of virtual devices that have access to the full range of user-space software, including infrastructure like Colossus and BigTable where needed.
Xen, VMware, and HyperV are also great hypervisors and machine emulators, but hopefully that gives you a glimpse into some of the reasons that KVM was a good fit for Google.