How to count the number of guest instructions QEMU executed from the beginning to the end of a run? - qemu

I want to benchmark guest instructions per second of QEMU to compare it with other simulators.
How to obtain the guest instruction count? I'm interested both in user and full system mode.
The only solutions I have now would be to log all instructions with either simple trace exec_tb or -d in_asm: How to use QEMU's simple trace backend? and then count the instructions from there. But this would likely considerably reduce simulation performance due to the output operations, so I would likely have to run the test program twice, one with and another without the trace, and hope that both executions are similar (should be, especially for single threaded user mode simulation).
I saw the -icount option, which sounds promising from the name, but when I passed it to QEMU 4.0.0, I didn't see anything happen. Should it print an instruction count somewhere? The following patch appears unmerged and suggests not: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01275.html

Basic Profiling
To follow up on Peter's answer, I have recently run into a situation where I wanted to get the instruction count of a program run under QEMU (I'm using v4.2.0, the first where plugins became available).
One of the example plugins, insn.c, does exactly what you want, and returns the count of executed instructions on plugin exit.
(I assume you already know how to run QEMU, so I'll strip this down to the important flags)
qemu-system-arm ... -plugin qemu-install-dir/build/tests/plugin/libinsn.so,arg=inline -d plugin
The first part loads the plugin and passes a single argument, "inline" to it. The next part enables printing of the plugin. You can redirect the plugin output to a different file by adding -D filename to the command line invocation.
More Advanced Profiling
When I was looking for possible ways to profile a program run under QEMU, this is one of the only results of my search that was promising. In the spirit of creating a good record for other searching in the future, here are some links to code that I have written to do just that.
Profiling Plugin code, docs.
Disclaimer: I wrote the above code.

Current released versions of QEMU don't provide any means for doing this. The upcoming "TCG plugin" support which should go out in the 4.2 release at the end of the year would allow you to write a simple "count the instructions executed" plugin, but this (as with the -d tracing) will add an overhead.
The -icount option is certainly confusing, but what it does is make the emulated CPU (try to) run at a specific number of executed instructions per second, as opposed to the default of "as fast as possible". This has higher overhead (and it will stop QEMU using multiple host threads for SMP guests), but is more deterministic.
Philosophically speaking, "instructions per second" is a rather misleading metric for emulators, because the time taken to execute an instruction can vary vastly compared to hardware. Loads and stores are rather slower than on real hardware. Floating point instructions are incredibly slow (perhaps a factor of 10 or worse of an integer arithmetic instruction, where real hardware could execute both in one cycle). JIT emulators like QEMU have a start-stop performance profile where execution stops entirely while we translate a block of code, whereas a real CPU or an interpreting emulator will not have these pauses. How much effect the JIT time has will depend on whether your code reruns previously translated hot code frequently or if it spends most of its time running "new" code, and whether it does things that result in the JIT having to discard the old code (eg self modifying code, or frequent between-process context switches). If you had an "IPS meter" on your emulator you'd see the value it reported fluctuate wildly as the guest code executed and did different things. You're probably better off just picking a benchmark which you think is representative of your actual use case, running it on various emulators, and comparing the wall-clock time it takes to complete.

Related

QEMU MIPS32 - 16550 Uart Implementation on a Custom Board

I’m trying to use QEMU to emulate a piece of firmware, but I’m having trouble getting the UART device to properly update the Line Status Register and display the input character.
Details:
Target device: Qualcomm QCA9533 (Documentation here if you're curious)
Target firmware: VxWorks 6.6 with U-Boot bootload
CPU: MIPS 24Kc
Board: mipssim (modified)
Memory: 512MB
Command used: qemu-system-mips -S -s -cpu 24Kc -M mipssim –nographic -device loader,addr=0xBF000000,cpu-num=0 -serial /dev/ttyS0 -bios target_image.bin
I have to apologize here, but I am unable to share my source. However, as I am attempting to retool the mipssim board, I have only made minor changes to the code, which are as follows:
Rebased bios memory region to 0x1F000000
Changed load_image_targphys() target address to 0x1F000000
Changed $pc initial value to 0xBF000000 (TLB remap of 0x1F000000)
Replaced the mipssim serial_init() ¬call with serial_mm_init(isa, 0x20000, env->irq[0], 115200, serial_hd(0), DEVICE_NATIVE_ENDIAN).
While it seems like serial_init() is probably the currently accepted standard, I wasn’t having any luck with remapping it. I noticed the malta board had no issues outputting on a MIPS test kernel I gave it, so I tried to mimic what was done there. However, I still cannot understand how QEMU works and I am unable to find many good resources that explain it. My slog through the source and included docs is ongoing, but in the meantime I was hoping someone might have some insight into what I’m doing wrong.
The binary loads and executes correctly from address 0xBF000000, but hangs when it hits the first UART polling loop. A look at mtree in the QEMU monitor shows that the I/O device is mapped correctly to address range 0x18020000-0x1802003F, and when the firmware writes to the Tx buffer, gdb shows the character successfully is written to memory. There’s just no further action from the serial device to pull that character and display it, so the firmware endlessly polls on the LSR waiting for an update.
Is there something I’m missing when it comes to serial/hardware interaction in QEMU? I would have assumed that remapping all of the existing functional components of the mipssim board would be enough to at least get serial communication working, especially since the target uses the same 16550 UART that mipssim does. Please let me know if you have any insights. It would be helpful if I could find a way to debug QEMU itself with symbols, but at the same time I’m not totally sure what I’d be looking for. Even advice on how to scope down the issue would be useful.
Thank you!
Well after a lot of hard work I got the UART working. The answer to the question lies within the serial_ioport_read() and serial_ioport_write() functions. These two methods are assigned as the callbacks QEMU invokes when data is read or written to the MemoryRegion for the serial device (which is initialized in serial_init() or serial_mm_init()). These functions do a bit of masking on the address (passed into the functions as addr) to determine which register is being referenced, then return the value from the SerialState struct corresponding to that register. It's surprisingly simple, but I guess everything seems simple once you've figured it out. The big turning point was the realization that QEMU effectively implements the serial device as a MemoryRegion with special functionality that is triggered on a memory operation.
Anyway, hope this helps someone in the future avoid the nightmare I went through. Cheers!

QEMU/QMP alert when writing to memory

I'm using QEMU to test some software for a personal project and I would like to know whenever the program is writing to memory. The best solution I have come up with is to manually add print statements in the file responsible for writing to memory. Which this would require remaking the object for the file and building QEMU, if I'm correct. But I came across QMP which uses JSON commands to manipulate QEMU, which has an entire list of commands, found here: https://raw.githubusercontent.com/Xilinx/qemu/master/qmp-commands.hx.
But after looking at that I didn't really see anything that would do what I want. I am sort of a new programmer and am not that advanced. And was wondering if anyone had some idea how to go about this a better way.
Recently (9 jun 2016) there were added powerful tracing features to mainline QEMU.
Please see qemu/docs/tracing.txt file as manual.
There are a lot of events that could be traced, see
qemu/trace_events file for list of them.
As i can understand the code, the "guest_mem_before" event is that you need to view guest memory writes.
Details:
There are tracing hooks placed at following functions:
qemu/tcg/tcg-op.c: tcg_gen_qemu_st * All guest stores instructions tcg-generation
qemu/include/exec/cpu_ldst_template.h all non-tcg memory access (fetch/translation time, helpers, devices)
There historically hasn't been any support in QEMU for tracing all guest memory accesses, because there isn't any one place in QEMU where you could easily add print statements to trace them. This is because more guest memory accesses go through the "fast path", where we directly generate native host instructions which look up the host RAM address in a data structure (QEMU's TLB) and perform the load or store. It's only if this fast path doesn't find a hit in the TLB that we fall back to a slow path that's written in C.
The recent trace-events event 'tcg guest_mem_before' can be used to trace virtual memory accesses, but note that it won't tell you:
whether the access succeeded or faulted
what the data being loaded or stored was
the physical address that's accessed
You'll also need to rebuild QEMU to enable it (unlike most trace events which are compiled into QEMU by default and can be enabled at runtime.)

Is there any difference in the output of nvvp (visual) and nvprof (command line)?

To measure metrics/events for CUDA programs, I have tried using the command line like:
nvprof --metrics <<metric_name>>
I also measured the same metrics on the Visual profiler nvvp. I noticed no difference in the values I get.
I noticed a difference in output when I choose a metric like achieved_occupancy. But this varies with every execution and that's probably why I get different results each time I run it, irrespective of whether I am using nvvp or nvprof.
The question:
I was under the impression that nvvp and nvprof are exactly the same, and that nvvp is simply a GUI built on top of nvprof for ease of use. However I have been given this advice:
Always use the visual profiler. Never use the command line.
Also, this question says:
I do not want to use the command line profiler as I need the global load/store efficiency, replay and DRAM utilization, which are much more visible in the visual profiler.
Apart from 'dynamic' metrics like achieved_occupancy, I never noticed any differences in results. So, is this advice valid? Is there some sort of deficiency in the way nvprof works? I would like to know the advantages of using the visual profiler over the command line form, if there are any.
More specifically, are there metrics for which nvprof gives wrong results?
Note:
My question is not the same as this or this because these are asking about the difference between nvvp and Nsight.
I'm not sure why someone would give you the advice:
Never use the command line.
assuming by "command line" you do in fact mean nvprof. That's not sensible. There are situations where it makes sense to use nvprof. (Note that if you actually meant the command line profiler, then that advice might be somewhat sensible, although still a matter of preference. It is separate from nvprof so has a separate learning curve. I personally would use nvprof instead of the command line profiler.)
nvvp uses nvprof under the hood, in order to do all of its measurement work. However nvvp may combined measured metrics in various interesting ways, e.g. to facilitate guided analysis.
nvprof should not give you "wrong results", and if it did for some reason, then nvvp should be equally susceptible to such errors.
Use of nvvp vs. nvprof may be simply a matter of taste or preference.
Many folks will like the convenenience of the GUI. And the nvvp GUI offers a "Guided Analysis" mode which nvprof does not. I'm sure there could be created an exhaustive list of other differences if you go through the documentation. But whatever nvvp does, it does it using nvprof. It doesn't have an alternate method to query the device for profiler data -- it uses nvprof.
I would use nvprof when it's inconvenient to use nvvp, perhaps when I am running on a compute cluster node where it's difficult or impossible to launch nvvp. You might also use it if you are doing targetted profiling (measuring a single metric, e.g. shared_replay_overhead - nvprofis certainly quicker than firing up the GUI and running a session), or if you are collecting metrics for tabular generation over a large series of runs.
In most other cases, I personally would use nvvp. The timeline feature itself is hugely more convenient than trying to assemble a sequence in your head from the output of nvprof --print-gpu-trace ... which is essentially the same info as the timeline.

qemu performance same with and without multi-threading and inconsistent behaviour

I am new to qemu simulator.I want to emulate our existing pure c h264(video decoder)code in arm platform(cortex-a9) using qemu in ubuntu 12.04 and I had done it successfully from the links available in the internet.
Also we are having multithreading(pthreads) code in our application to speed up the process.If we enable multithreading we are getting the same performance (i.e)single thread(without multithreading).
Eg. single thread 9.75sec
Multithread 9.76sec
Since qemu will support parallel processing we are not able to get the performance.
steps done are as follows
1.compile the code using arm-linux-gnueabi-toolchain
2.Execute the code
qemu-arm -L executable
3.qemu version 1.6.1
Is there any option or settings has to be done in qemu if we want measure the performance in multi threading because we want to get the difference between single thread and multithread using qemu since we are not having any arm board with us.
Moreover,multithreading application hangs if we run for third time or fourth time i.e inconsistent behaviour in qemu.
whether we can rely on this qemu simulator or not since it is not cycle accurate.
You will not be able to use QEMU to estimate real hardware speed.
Also QEMU currently supports SMP running in a single thread... this means your guest OS will see multiple CPUs but will not recieve adicional cycles since all the emulation is occuring in a single thread.
Note that IO is delegated to separate threads... so usually if your VM is doing cpu and IO work you will see at least 1.5+ cores on the host being used.
There has been alot of research into parallelizing the cpu emulation in qemu but without much sucess. I suggest you buy some real hardware and run it there especially consiering that coretex-a9 hardware is cheap these days.

CUDA apps time out & fail after several seconds - how to work around this?

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.