How does qemu do device emulation for memory mapped devices when using a HW CPU (KVM)? - qemu

How does qemu intercept only those addresses in the address space that belong to memory mapped devices?
Can someone please explain the full path of, let's say, a read? How does the read from an address X get intercepted (and directed to a device back-end)? And then how is the read completed?

Related

Does cuMemcpy "care" about the current context?

Suppose I have a GPU and driver version supporting unified addressing; two GPUs, G0 and G1; a buffer allocated in G1 device memory; and that the current context C0 is a context for G0.
Under these circumstances, is it legitimate to cuMemcpy() from my buffer to host memory, despite it having been allocated in a different context for a different device?
So far, I've been working under the assumption that the answer is "yes". But I've recently experienced some behavior which seems to contradict this assumption.
Calling cuMemcpy from another context is legal, regardless of which device the context was created on. Depending on which case you are in, I recommend the following:
If this is a multi-threaded application, double-check your program and make sure you are not releasing your device memory before the copy is completed
If you are using the cuMallocAsync/cuFreeAsync API to allocate and/or release memory, please make sure that operations are correctly stream-ordered
Run compute-sanitizer on your program
If you keep experiencing issues after these steps, you can file a bug with NVIDIA here.

How to access pci device from another device

I'm creating new PCI device in qemu that is part DMA and part NVMe controller.
And I need to get the physical address of the NVMe device, from within my new device to use
dma_memory_read(...)
Is there a function to get new device address?
Is there other function that I can use without physical address?
Is there another way to do it, through pointers?
Generally the best question to ask when trying to figure out how to model devices in QEMU is "what does the real hardware do?".
For real PCI devices, the only way they can access other devices elsewhere in the system is if they do DMA accesses, which they do using PCI addresses (which are usually about the same thing as physical addresses on x86, but not necessarily so on other architectures). In QEMU we model this by having APIs for PCI devices to do DMA accesses (pci_dma_*()).
On the other hand, if you have a PCI card that is itself implementing an NVMe controller (or another kind of controller, like a SCSI disk controller), the answer is that the disks are plugged directly into the controller, which then can talk to them with no physical addresses involved at all. In QEMU we model this by having a concept of controller devices possibly having a "bus" which the disks are plugged into.
How does the real hardware talk between the PCI device and the NVMe ? Generally the answer is not "weird backdoor mechanism" and so you shouldn't be looking for an API in QEMU that corresponds to "weird backdoor mechanism".

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

what interrupt Controller a guest OS use, find it

My host system is Linux. I am using qemu as an emulator. I want to know what interrupt controller is used by the guest OS. Other information also like what interrupts are called etc.
guide me in details
Linux can use the PIC 8259, the simplest interrupt controller that has 2 banks of 8 pins. Then there is the APIC with 256 interrupt vectors that can be assigned to devices. You read the PCI memory for a given device to find out what interrupts it has been assigned for PIC and can tell a device to use a given ISR vector for APIC. Then there is MSI and MSI-X that use messages and not interrupts for potentially higher performance. hth

Can I access device global memory from a host?

CUDA C programming Guide states that on compute capacity above 2.0, host and device share the memory space on a 64-bit linux. I have a piece of global memory allocated via the standard run time API "cudaMalloc", but it seems the host cannot directly access it. Should I do something special to make it accessible to host?
Device memory allocated statically or dynamically is not directly accessible (e.g. by dereferencing a pointer) from the host. It is necessary to access it via a cuda runtime API call like cudaMemset, or cudaMemcpy. The fact that they share the same address space (UVA) does not mean they can be accessed the same way. It simply means that if I have a device pointer that has been allocated at a particular location such as 0x00F0000 in logical address space, I should not expect to find a host pointer at the same location. Therefore, given suitable record-keeping, I can inspect the numerical value of the pointer and immediately determine whether it is a host or device pointer.
In the programming guide, it states:
Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime (described in Programming Interface). This includes device memory allocation and deallocation as well as data transfer between host and device memory.