Trying to learn the KVM and VCPUs structs - qemu

Please bear with me as I'm trying to learn the particulars of the KVM and VCPUs structs in hopes to write some code to utilize them. I've been looking through the code to try and figure things out but I'm still a little unclear. Regarding the KVM struct, when a VM is created is there a KVM struct for that KVM? So is there a 1:1 relationship there? Looking through the struct and code that looks to be the case to me. Regarding the VPCU struct, I noticed there is a pointer to the KVM struct. Is this just a recursive pointer back to the KVM struct that the VCPU struct is in? Thanks in advance!

Yes.
There is one kvm struct per virtual machine. This struct contains an array called vcpus, the elements of which are pointers to the vcpus associated with the 'machine'.
If you wanted to know that, you may also be interested in the fact that the vm_list member of the kvm struct can be used to iterate through all of the kvm-based virtual machines on the host.

Related

QEMU MIPS32 - 16550 Uart Implementation on a Custom Board

I’m trying to use QEMU to emulate a piece of firmware, but I’m having trouble getting the UART device to properly update the Line Status Register and display the input character.
Details:
Target device: Qualcomm QCA9533 (Documentation here if you're curious)
Target firmware: VxWorks 6.6 with U-Boot bootload
CPU: MIPS 24Kc
Board: mipssim (modified)
Memory: 512MB
Command used: qemu-system-mips -S -s -cpu 24Kc -M mipssim –nographic -device loader,addr=0xBF000000,cpu-num=0 -serial /dev/ttyS0 -bios target_image.bin
I have to apologize here, but I am unable to share my source. However, as I am attempting to retool the mipssim board, I have only made minor changes to the code, which are as follows:
Rebased bios memory region to 0x1F000000
Changed load_image_targphys() target address to 0x1F000000
Changed $pc initial value to 0xBF000000 (TLB remap of 0x1F000000)
Replaced the mipssim serial_init() ¬call with serial_mm_init(isa, 0x20000, env->irq[0], 115200, serial_hd(0), DEVICE_NATIVE_ENDIAN).
While it seems like serial_init() is probably the currently accepted standard, I wasn’t having any luck with remapping it. I noticed the malta board had no issues outputting on a MIPS test kernel I gave it, so I tried to mimic what was done there. However, I still cannot understand how QEMU works and I am unable to find many good resources that explain it. My slog through the source and included docs is ongoing, but in the meantime I was hoping someone might have some insight into what I’m doing wrong.
The binary loads and executes correctly from address 0xBF000000, but hangs when it hits the first UART polling loop. A look at mtree in the QEMU monitor shows that the I/O device is mapped correctly to address range 0x18020000-0x1802003F, and when the firmware writes to the Tx buffer, gdb shows the character successfully is written to memory. There’s just no further action from the serial device to pull that character and display it, so the firmware endlessly polls on the LSR waiting for an update.
Is there something I’m missing when it comes to serial/hardware interaction in QEMU? I would have assumed that remapping all of the existing functional components of the mipssim board would be enough to at least get serial communication working, especially since the target uses the same 16550 UART that mipssim does. Please let me know if you have any insights. It would be helpful if I could find a way to debug QEMU itself with symbols, but at the same time I’m not totally sure what I’d be looking for. Even advice on how to scope down the issue would be useful.
Thank you!
Well after a lot of hard work I got the UART working. The answer to the question lies within the serial_ioport_read() and serial_ioport_write() functions. These two methods are assigned as the callbacks QEMU invokes when data is read or written to the MemoryRegion for the serial device (which is initialized in serial_init() or serial_mm_init()). These functions do a bit of masking on the address (passed into the functions as addr) to determine which register is being referenced, then return the value from the SerialState struct corresponding to that register. It's surprisingly simple, but I guess everything seems simple once you've figured it out. The big turning point was the realization that QEMU effectively implements the serial device as a MemoryRegion with special functionality that is triggered on a memory operation.
Anyway, hope this helps someone in the future avoid the nightmare I went through. Cheers!

CUDA: calling a __host__ function() from a __global__ function() is not allowed [duplicate]

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

What steps are necessary to pipeline a processor in VHDL?

This is a homework question, obviously. I'm trying to pipeline a simple, 5 stage (IF,ID,EX,MEM,WB), single-cycle MIPS processor in VHDL. I don't need to implement forwarding or hazard detection for it. I'm just unsure of what components I need to implement.
Is it necessary to create D Flip-Flops for each signal?
The pipeline implementation here uses a for-loop for the outputs - is that something I should do?
Any tips would be much appreciated, I can't seem to find much relevant information on pipelining in VHDL.
What you probably want to do is create a separate entity for each stage of your pipeline and then connect the output of one stage to the input of the other.
To make sure things are pipelined correctly, you just need to make sure that each stage only does whatever processing it needs to do on the rising edge.
If you want an example, take a look at this project of mine. Specifically at the files dft_top.vhd and dft_stage[1-3].vhd. It implements a 16-point 16-bit fixed point DFT in pipelined stages.

How to call a host function in a CUDA kernel?

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

cuda trace emulation -Need some expert insight

I am working on a gpu trace emulation tool in windows as part of my research work in grad school . I am working on cuda runtime trace emulation to be specific.
I use simple DLL injection using MS Detours to enable interception of the cuda runtime APIs. I store the API calls and their parameters in a trace file. I get into some problems while trying to emulate the API from my trace file(I use the word playback to denote this action)
A typical trace file begins by making calls to __cudaRegisterFatBinary and __cudaRegisterFunction. This is followed by a call to cudaMalloc.
What I did?
1) I came across the famous GPUOcelot and I found the cubin structure that Nvidia is using right now. I am using that to save the address parameter of cudaRegisterFatBinary in intercept mode and I am using the pointer in the playback for _cudaRegisterFatBinary by repopulating the structure in the memory.
2)In _cudaRegisterFunction I am not sure what the parameters hostFunction,Device Function and Device Name refer to. I mean I don't understand how I could populate it while playing back from my trace file. I am just saving the pointer from the original execution and using it to imitate the call. But there is no way of knowing whether the function goes through fine since it does not have a return value.
3)cudaMalloc following these two entry point functions return cuda error code 11. It is cuda invalid value according to the Nvidia documentation. I have no idea why this should be the case. I am assuming that something is wrong with the previous two function calls. I also have a feeling that something is wrong with implicit primary context creation by the cuda runtime. Can someone give me some insights about cuda runtime execution and point me to what might I be missing?
I know its a ton of information without any useful code. I dont know which part of the code to post here. I will do it when people start taking interest in my question and ask me specific things about my project. Initially am just hoping that I am missing something big and high level that one of you can spot.
I greatly appreciate your time and interest!
Sounds very interesting overall. Your "Error:Cuda invalid value" is could be related to the params of _cudaRegisterFunction. The param 'DeviceName' sounds like it identifies which GPU (card?) to use. Check the CUDA SDK, there are many demos that enumerate the GPUs on the system, perhaps these values are valid for 'DeviceName'. As for 'hostFunction' and 'deviceFunction', these sound like either function IDs, or perhaps function pointers. Also, you can call 'cudaGetLastError()' to test whether the function call was successful (it returns 'cudaSuccess' if everything is ok... take a look at the error logging macros in the sdk). Good luck!