The qemu documentation talks about socket connected device backends (or by other mechanisms) and I imagine this would be a way to handle memory mapped IO to the device in an external program.
But I can't find any good examples on how to make this happen. Does anyone have any pointers on how to do this with qemu?
Related
I have a host application which needs to access qemu physical address. For a RAM memory region.
I am not able to figure out the right way to do this. Any help is appreciated.
I am currently thinking of adding a socket server to the QEMU event loop to do this. But am not able to find the documentation for this...
You can run qemu with gdb server enabled (-s), connect to the gdb server (TCP port 1234) and send your commands for memory reads to it.
Some software we have developed, we have "encapsulated it" into a virtual machine, that we run with Virtualbox, in command line in a non interactive way (no graphical interface). We send some instructions to the virtual machine, and it outputs some resulting files. We have tested this locally in a Linux machine. Now we would like to send this to many people using Linux, but we realize they will have different distributions, system libraries versions, etc, and then our VM might fail. So my question, is, it is possible to have someething like a static binary version of Virtualbox (or any other similar system / VM / Container) that does not need to use the system libraries, so that it can be run like a static binary?
It would be important to know what are the 'special' requirements of your solution regarding system libraries and the kind.
If you are using a standard host configuration, a standard VirtualBox install should be able to run the VM on any host OS.
Since a VM runs its own kernel, for the most part, is not dependent on host libraries. The the exception to this is when accessing/controlling host resources (disk, net, etc.). This being said, VirtualBox provides ways to access the most common resources (disk, net, etc.) that are transparent for the VM. Meaning that the VM will be configured always in the same way, regardless of whether the host is Win, Linux or Mac, and you can export your VM on Linux and importing it in other platforms without having to tweak it.
A container (eg. dockers) is more complicated, since it shares the kernel of the host, and it depends on how the host kernel is configured.
Again, if your application doesn't depend on 'special' access to host resources, a Docker will run the same way on all host OSs (Linux provides a native kernel, while Win and Mac run a linux virtual machine and then dockers inside it)
If you feel this doesn't answer your question, please share more details about the 'special' needs/configurations of your application, so we can dive deeper into this.
Have you looked at providing portable instances of the VM that can run on different host systems?
This example show how to create one for a windows host, check it out here, but I'm sure it can be done for different host systems as well.
So I saw there is an option in google compute (I assume the same option exists in other cloud VM suppliers so the question isnt specifically on Google compute, but on the underlying technology) to resize the disk without having to restart the machine, and I ask, how is this possible?
Even if it uses some sort of abstraction to the disk and they dont actually assign a physical disk to the VM, but just part of the disk (or part of a number of disks), once the disk is created in the guest VM is has a certain size, how can it change without needing a restart? Does it utilize NFS somehow?
This is built directly into disk protocols these days. This capability has existed for a while, since disks have been virtualized since the late 1990s (either through network protocols like iSCSI / FibreChannel, or through a software-emulated version of hardware like VMware).
Like the VMware model, GCE doesn't require any additional network hops or protocols to do this; the hypervisor just exposes the virtual disk as if it is a physical device, and the guest knows that its size can change and handles that. GCE uses a virtualization-specific driver type for its disks called VirtIO SCSI, but this feature is implemented in many other driver types (across many OSes) as well.
Since a disk can be resized at any time, disk protocols need a way to tell the guest that an update has occurred. In general terms, this works as follows in most protocols:
Administrator resizes disk from hypervisor UI (or whatever storage virtualization UI they're using).
Nothing happens inside the guest until it issues an IO to the disk.
Guest OS issues an IO command to the disk, via the device driver in the guest OS.
Hypervisor emulates that IO command, notices that the disk has been resized and the guest hasn't been alerted yet, and returns a response to the guest telling it to update its view of the device.
The guest OS recognizes this response and re-queries the device size and other details via some other command.
I'm not 100% sure, but I believe the reason it's structured like this is that traditionally disks cannot send updates to the OS unless the OS requests them first. This is probably because the disk has no way to know what memory is free to write to, and even if it did, no way to synchronize access to that memory with the OS. However, those constraints are becoming less true to enable ultra-high-throughput / ultra-low-latency SSDs and NVRAM, so new disk protocols such as NVMe may do this slightly differently (I don't know).
I'm reading the HSA spec and it says the user mode application can submit their jobs into GPU queues directly without any OS interaction. I think this must because the application can talk with the GPU driver directly, therefore doesn't need to incur any OS kernel calls.
So my questions is, for a very simple example, in CUDA application, when we make a cudaMalloc(), does it incur any OS kernel calls?
The entire premise of this question is flawed. "Submitting a job" and allocating memory are not the same thing. Even a user space process running on the host CPU which calls malloc will (most of the time) result in a kernel call as the standard library gathers or releases physical memory to its memory heap, normally via sbrk or mmap.
So yes,cudaMalloc results in an OS kernel call - if you run strace you will see the GPU driver invoking ioctl to issue commands to the GPU MMU/TLB. But so does running malloc in host code, and so, undoubtedly does running malloc on a theoretical HSA platform as well.
Scenario:
I have two machines, a client and a server, connected with Infiniband. The server machine has an NVIDIA Fermi GPU, but the client machine has no GPU. I have an application running on the GPU machine that uses the GPU for some calculations. The result data on the GPU is never used by the server machine, but is instead sent directly to the client machine without any processing. Right now I'm doing a cudaMemcpy to get the data from the GPU to the server's system memory, then sending it off to the client over a socket. I'm using SDP to enable RDMA for this communication.
Question:
Is it possible for me to take advantage of NVIDIA's GPUDirect technology to get rid of the cudaMemcpy call in this situation? I believe I have the GPUDirect drivers correctly installed, but I don't know how to initiate the data transfer without first copying it to the host.
My guess is that it isn't possible to use SDP in conjunction with GPUDirect, but is there some other way to initiate an RDMA data transfer from the server machine's GPU to the client machine?
Bonus: If somone has a simple way to test if I have the GPUDirect dependencies correctly installed that would be helpful as well!
Yes, it is possible with supporting networking hardware. See the GPUDirect RDMA documentation.