How to improve the speed of Qemu TCG - qemu

The user-mode TCG of qemu executes the MIPS program, or debootstrap is very slow to create the MIPS basic image. Is there any way to optimize the efficiency of TCG translation binary?
It takes time to execute debootstrap on the MIPS machine to build the basic image:144 second
It takes time to execute debootstrap on the x86 machine to build the MIPS64el basic image:714 second

As an end-user there is not much you can do: emulation is just a slow process, and there aren't any tuning knobs on QEMU to change aspects of it. If you're using an older QEMU you could try upgrading to a newer one -- there have been some performance improvements over time. (Don't expect miracles, though.)

Related

Profile debug or release cuda code?

I have been profiling an application with nvprof and nvvp (5.5) in order to optimize it. However, I get totally different results for some metrics/events like inst_replay_overhead, ipc or branch_efficiency, etc. when I'm profiling the debug (-G) and release version of the code.
so my question is: which version should I profile? The release or debug version? Or the choice depends upon what I'm looking for?
I found CUDA - Visual Profiler and Control Flow Divergence where is stated that a debug (-G) version is needed to properly measure the divergent branches metric, but I am not sure about other metrics.
Profiling usually implies that you care about performance.
If you care about performance, you should profile the release version of a CUDA code.
The debug version (-G) will generate different code, which usually runs slower. There's little point in doing performance analysis (including execution time measurement, benchmarking, profiling, etc.) on a debug version of a CUDA code, in my opinion, for this reason.
The -G switch turns off most optimizations that the device code compiler might ordinarily make, which has a large effect on code generation and also often a large effect on performance. The reason for the disabling of optimizations is to facilitate debug of code, which is the primary reason for the -G switch and for a debug version of your code.

qemu performance same with and without multi-threading and inconsistent behaviour

I am new to qemu simulator.I want to emulate our existing pure c h264(video decoder)code in arm platform(cortex-a9) using qemu in ubuntu 12.04 and I had done it successfully from the links available in the internet.
Also we are having multithreading(pthreads) code in our application to speed up the process.If we enable multithreading we are getting the same performance (i.e)single thread(without multithreading).
Eg. single thread 9.75sec
Multithread 9.76sec
Since qemu will support parallel processing we are not able to get the performance.
steps done are as follows
1.compile the code using arm-linux-gnueabi-toolchain
2.Execute the code
qemu-arm -L executable
3.qemu version 1.6.1
Is there any option or settings has to be done in qemu if we want measure the performance in multi threading because we want to get the difference between single thread and multithread using qemu since we are not having any arm board with us.
Moreover,multithreading application hangs if we run for third time or fourth time i.e inconsistent behaviour in qemu.
whether we can rely on this qemu simulator or not since it is not cycle accurate.
You will not be able to use QEMU to estimate real hardware speed.
Also QEMU currently supports SMP running in a single thread... this means your guest OS will see multiple CPUs but will not recieve adicional cycles since all the emulation is occuring in a single thread.
Note that IO is delegated to separate threads... so usually if your VM is doing cpu and IO work you will see at least 1.5+ cores on the host being used.
There has been alot of research into parallelizing the cpu emulation in qemu but without much sucess. I suggest you buy some real hardware and run it there especially consiering that coretex-a9 hardware is cheap these days.

Why is the initialization of the GPU taking very long on Kepler achitecture and how to fix this?

When running my application the very first cuda_malloc takes 40 seconds which is due to the initialization of the GPU. When I build in debug mode this reduces to 5 seconds and when I run the same code on a Fermi device, it takes far less than a second (not even worth measuring in my case).
Now the funny thing is that if I compile for this specific architecture, using the flag sm35 instead of sm20, it becomes fast again. As I should not use any new sm35 features just yet, how can I compile for sm20 and not have this huge delay? Also I am curious what is causing this delay? Is the machine code recompiled on the fly into sm35 code?
Ps. I run on windows but a colleague of mine encountered the same problem, probably on windows. The device is a Kepler, driver version 320.
Yes, the machine code is recompiled on the fly. This is called the JIT-compile step, and it will occur any time the machine code does not match the device that is being used (and assuming valid PTX code exists in the executable.)
You can learn more about JIT-compile here. Note the discussion of the cache which should alleviate the issue after the first run.
If you specify compilation for both sm_20 and sm_35, you can build a binary/executable that will run quickly on both types of devices, and you will also get notification if you are using a sm_35 feature that is not supported on sm_20 (during the compile process).

Run my application in a simulated low memory, slow CPU environment

I want to stress-test my application this way, because it seems to be failing in some very old client machines.
At first I read a bit about QEmu and thought about hardware emulation, but it seems a long shot. I asked at superuser, but didn't get much feedback (yet).
So I'm turning to you guys... How do you this kind of testing?
I'm not sure about slowing a CPU but if you use a virtual machine, like VMWare, you can control how much RAM is actually used. I run it on a MBP at home with 8GB and my WinXP VM is capped at 1.5 GB RAM.
EDIT: I just checked my version of VMWare and I can control the number of cores it can use. It's definitely not the same as a slower CPU but it might highlight some issues for you.
Since it's not entirely clear that your app is failing because of the old hardware or the old OS, a VM client should allow you to test various versions of OSes rather quickly. It came in handy for me a few years back when I was trying to get a .Net 2.0 app to run on Win98 (it can be done though I don't remember how I got it working...).
Virtual Box is a free virtual machine similar to VMWare. It also has the capacity to reduce available memory. It can restrict how many CPUs are available, but not how fast those CPUs are.
Try cpulimit, most distro includes it (Ubuntu does) http://www.digipedia.pl/man/doc/view/cpulimit.1
If you want to lower the speed of your cpu you can easily do this by modifying a fork bomb program
int main(){
int x=0;
int limit = 10
while( x < limit ){
int pid = fork();
if( pid == 0 )
while( 1 ){}
else
x++;
}
}
This will slow down your computer quite quickly, you may want to change the limit variable to a higher number. I must warn you though this can be dangerous, because if implemented wrong you could fork bomb your system leaving it useless unless you restart it. Read this first if you don't understand what this code will do.
On POSIX (Unix) systems you can apply run limits to processes (that is, to executions of a program). The system call to do this is called setrlimit(), and most shells enable you to use the ulimit built-in to set them from the command-line (plain POSIX ulimit is not very useful). Using these you can run a program with low limits to simulate a smaller computer.
POSIX systems also provide a nice command for running a program at lower CPU priority, which can simulate a slower CPU if you also ensure there is another CPU intensive progam running at the same time.
I think it's pretty unlikely that cpu speed is going to exercise very many bugs; On the other hand, it's much more likely for different cpu features to matter. Many VM implementations provide ways of toggling on and off certain cpu features; qemu in particular permits a high level of control over what's available to the CPU.
Think outside the box. Which application of the ones you use regularly does this?
A debugger of course! But, how can you achieve such a behavior, to emulate a low cpu?
The secret to your question is asm _int 3. This is the assembly "pause me" command that is send from the attached debugger to the application you are debugging.
More about int 3 to this question.
You can use the code from this tool to pause/resume your process continuously. You can add an interval and make that tool pause your application for that amount of time.
The emulated-cpu-speed would be: (YourCPU/Interval) -0.00001% because of the signaling and other processes running on your machine, but it should do the trick.
About the low memory emulation:
You can create a wrapper class that allocates memory for the application and replace each allocation with call to this class. You would be able to set exactly the amount of memory your application can use before it fails to allocate more memory.
Something such as: MyClass* foo = AllocWrapper(new MyClass(arguments or whatever));
Then you can have the AllocWrapper allocating/deallocating memory for you.
On Linux, you can use ulimit as Raedwald said. On Windows, you can use the SetProcessWorkingSetSize system call. But these only set a limit on a per process basis. In reality, parts of the system will start to fail in a stressed environment. I would suggest using the Sysinternals' testlimit tool to stress the entire machine.
See https://serverfault.com/questions/36309/throttle-down-cpu-speed-of-vmware-image
were it is claimed free-as-in-beer VMware vSphere Hypervisor™ (ESXi) allows you to select the virtual CPU speed on top of setting the memory size of the virtual machine.

registers vs stacks

What exactly are the advantages and disadvantages to using a register-based virtual machine versus using a stack-based virtual machine?
To me, it would seem as though a register based machine would be more straight-forward to program and more efficient. So why is it that the JVM, the CLR, and the Python VM are all stack-based?
Implemented in hardware, a register-based machine is going to be more efficient simply because there are fewer accesses to the slower RAM. In software, however, even a register based architecture will most likely have the "registers" in RAM. A stack based machine is going to be just as efficient in that case.
In addition a stack-based VM is going to make it a lot easier to write compilers. You don't have to deal with register allocation strategies. You have, essentially, an unlimited number of registers to work with.
Update: I wrote this answer assuming an interpreted VM. It may not hold true for a JIT compiled VM. I ran across this paper which seems to indicate that a JIT compiled VM may be more efficient using a register architecture.
This has already been answered, to a certain level, in the Parrot VM's FAQ and associated documents:
A Parrot Overview
The relevant text from that doc is this:
the Parrot VM will have a register architecture, rather than a stack architecture. It will also have extremely low-level operations, more similar to Java's than the medium-level ops of Perl and Python and the like.
The reasoning for this decision is primarily that by resembling the underlying hardware to some extent, it's possible to compile down Parrot bytecode to efficient native machine language.
Moreover, many programs in high-level languages consist of nested function and method calls, sometimes with lexical variables to hold intermediate results. Under non-JIT settings, a stack-based VM will be popping and then pushing the same operands many times, while a register-based VM will simply allocate the right amount of registers and operate on them, which can significantly reduce the amount of operations and CPU time.
You may also want to read this: Registers vs stacks for interpreter design
Quoting it a bit:
There is no real doubt, it's easier to generate code for a stack machine. Most freshman compiler students can do that. Generating code for a register machine is a bit tougher, unless you're treating it as a stack machine with an accumulator. (Which is doable, albeit somewhat less than ideal from a performance standpoint) Simplicity of targeting isn't that big a deal, at least not for me, in part because so few people are actually going to directly target it--I mean, come on, how many people do you know who actually try to write a compiler for something anyone would ever care about? The numbers are small. The other issue there is that many of the folks with compiler knowledge already are comfortable targeting register machines, as that's what all hardware CPUs in common use are.
Traditionally, virtual machine implementors have favored stack-based architectures over register-based due to 'simplicity of VM implementation' ease of writing a compiler back-end - most VMs are originally designed to host a single language and code density and executables for stack architecture are invariably smaller than executables for register architectures. The simplicity and code density are a cost of performance.
Studies have shown that a registered-based architecture requires an average of 47% less executed VM instructions than stack-based architecture, and the register code is 25% larger than corresponding stack code but this increase cost of fetching more VM instructions due to larger code size involves only 1.07% extra real machine loads per VM instruction which is negligible. The overall performance of the register-based VM is that it takes, on average, 32.3% less time to execute standard benchmarks.
One reason for building stack-based VMs is that that actual VM opcodes can be smaller and simpler (no need to encode/decode operands). This makes the generated code smaller, and also makes the VM code simpler.
How many registers do you need?
I'll probably need at least one more than that.
Stack based VM's are simpler and the code is much more compact. As a real world example, a friend built (about 30 years ago) a data logging system with a homebrew Forth VM on a Cosmac. The Forth VM was 30 bytes of code on a machine with 2k of ROM and 256 bytes of RAM.
It is not obvious to me that a "register-based" virtual machine would be "more straight-forward to program" or "more efficient". Perhaps you are thinking that the virtual registers would provide a short-cut during the JIT compilation phase? This would certainly not be the case, since the real processor may have more or fewer registers than the VM, and those registers may be used in different ways. (Example: values that are going to be decremented are best placed in the ECX register on x86 processors.) If the real machine has more registers than the VM, then you're wasting resources, fewer and you've gained nothing using "register-based" programming.
Stack based VMs are easier to generate code for.
Register based VMs are easier to create fast implementations for, and easier to generate highly optimized code for.
For your first attempt, I recommend starting with a stack based VM.