Is simulated PowerPC faster than actual PowerPC? - qemu

I have used PowerPC chip emulated by QEMU and currently am using Xilinx Virtex II pro to execute PowerPC instructions.
On both I run a custom RTOS and measure the time taken by a task. The contents of the task does not differ between the implementations, but the time taken by it has a considerable gap.
The time taken on QEMU is around 200 microseconds, whereas time taken on Xilinx chip is about 2000 microseconds.
Why does this happen ? Shouldn't running the RTOS on hardware directly be faster than emulating it ?
Edit: the speed of both is 300 MHz

It's not inconceivable that QEMU running on some host could perform better than some other particular real hardware. In vague marketing terms, the performance of QEMU's PowerPC JIT could get into mid-hundreds-of-MIPS on (unspecified) systems in 2010, which is comparable to a low-hundreds-of-MHz PowerPC 405 (Xilnix Virtex II Pro Datasheet, PowerPC 405 Core Product Overview).
Whether the performance difference between QEMU on whatever you are running it on and a PowerPC 405 core on your FPGA is of enough magnitude to explain the measured time difference you're seeing in whatever you're trying to run on it is a different question, and is unclear without more information.

Related

Does nvidia gpu work less efficiently when it is the only gpu in PC?

I want to assemble a new computer mainly for CUDA applications. When it comes to CPU I have to choose between AMD and Intel.
Most of the AMD's processors don't have integrated gpu while Intel's processors do.
My question is:
If the nvidia gpu would be the only graphic processing unit in the whole PC (without integrated one),
would its efficiency for CUDA programs be worse as it has to produce some graphics on a desktop (while using for example Matlab)?
The anwer is yes, efficiency would be slightly lower due to the GPU doing display tasks, like moving the cursor around or scrolling a display in a .pdf browser.
however if you are aiming for a reasonably mid-to-high-end GPU, the loss of efficiency is marginal. If you have enough money, you will buy dedicated GPU, but if not, then just don't bother. It might be like 1% or less.
A bigger problem is that the display takes up RAM, that (a) becomes unavailable to CUDA applications and (b) the CUDA manual states that the display driver is allowed to dis-own the CUDA application from it's memory at any time without warning (!).
If you ask me if that does really happen (display driver taking over the CUDA app memory), then yes, I have experienced it, with the prime example being when you change the resolution of your display.
So definetely don't do any banking with GPUs or you might see your accounts being randomly infused with millions :-)
That's why 'proffesional' CUDA cards (the tesla variety) have no display outputs - just in case.

nvidia-smi GPU performance measure does not make sense

I am using Nvidia GTX Titan X to do deep learning experiment.
I am using nvidia-smi to monitor the GPU running state, but the perf(ormance) state the tool provided does not make sense.
I have check out the nvidia-smi manual, it said the following:
Performance State
The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
Without running any process on GPU(idle state),the GPU performance state is p0.
However, when running some computation heavy process, the state became p2.
My question is, why my GPU is at P0 state at idle, but switch to P2 when running heavy computation task? Shouldn't it be the opposite?
Also, is there a way to make my GPU always run at P0 state(maximum performance)?
It is confusing.
The nvidia-smi manual is correct, however.
When a GPU or set of GPUs are idle, the process of running nvidia-smi on a machine will usually bring one of those GPUs out of the idle state. This is due to the information that the tool is collecting - it needs to wake up one of the GPUs.
This wake up process will initially bring the GPU to P0 state (highest perf state), but the GPU driver will monitor that GPU, and eventually start to reduce the performance state to save power, if the GPU is idle or not particularly busy.
On the other hand, when the GPUs are active with a workload, the GPU driver will, according to its own heuristic, continuously adjust the performance state to deliver best performance while matching the performance state to the actual workload. If no thermal or power limits are reached, the perf state should reach its highest level (P0) for the most active and heaviest, continuous workloads.
Workloads that are periodically heavy, but not continuous, may see the GPU power state fluctuate around levels P0-P2. GPUs that are "throttled" due to thermal (temperature) or power issues may also see reduced P-states. This type of throttling is evident and reported separately in nvidia-smi, but this type of reporting may not be enabled for all GPU types.
If you want to see the P0 state on your GPU, the best advice I can offer is to run a short, heavy, continuous workload (something that does a large sgemm operation, for example), and then monitor the GPU during that workload. It should be possible to see P0 state in that situation.
If you are using a machine learning application (e.g. Caffe) that is using the cuDNN library, and you are training a large network, it should be possible to see P0 from time to time, because cuDNN does operations that are something like sgemm in this scenario, typically.
But for a sporadic workload, it's quite possible that the most commonly observed state would be P2.
To "force" a P0 power state always, you can try experimenting with the persistence mode and applications clocks via the nvidia-smi tool. Use nvidia-smi --help or the man page for nvidia-smi to understand the options.
Although I don't think this will typically apply to Tesla GPUs, some NVIDIA GPUs may limit themselves to a P2 power state under compute load unless application clocks are specifically set higher. Use the nvidia-smi -a command to see the current Application clocks, the Default Application Clocks, and the Max Clocks available for your GPU. (Some GPUs, including older GPUs, may display N/A for some of these fields. That generally indicates the applications clocks are not modifiable via nvidia-smi.) If a card seems to run at the P2 state during compute load, you may be able to increase it to P0 state by increasing the application clocks to the maximum available (i.e. Max Clocks). Use nvidia-smi --help to learn how to format the command to change the application clocks on your GPU. Modifying application clocks, or enabling modifiable application clocks, may require root/admin privilege. It may also be desirable or necessary to set the GPU Persistence mode. This will prevent the driver from "unloading" during periods of GPU activity, which may cause the application clocks to be reset when the driver re-loads.
This default behavior, for the affected cards in this situation, of limiting to P2 under compute load, is by design of the GPU driver.
This somewhat related question/answer may also be of interest.

Is there any way or even possible to get the overall utilization of a GPU during a period of time?

I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time.
By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how many GPU cores are scheduled to run (I suppose if a core is running, it will run at its full speed/frequency?). It would be also nice if I can get the overall utilization measured by flops.
Of course before asking the question here, I've searched and investigated several existing tools/libraries, including NVML (and nvidia-smi built on top of it), CUPTI (and nvprof), PAPI, TAU, and Vampir. However, it seems (but I am not sure yet) none of them could provide me with the needed information. E.g., NVML can report "GPU Utilization" by percent, but according to its document/comment, this utilization is "Percent of time over the past second during which one or more kernels was executing on the GPU", which is apparently not accurate enough. For nvprof, it can report flops for individual kernel (with very high overhead), but I still don't know how well the GPU is utilized.
PAPI seems to be able to get instruction count, but it cannot different float point operation from others. I haven't tried other two tools (TAU and Vampir) yet, but I doubt they can meet my need.
So I am wondering is it even possible to get the overall utilization information of a GPU? If not, what is the best alternative to estimate it? The purpose I am doing this is to find a better schedule for multiple jobs running on top of GPU.
I am not sure if I've described my question clearly enough, so please let me know if there is anything I can add for a better description.
Thank you very much!
nVidia Nsight plugin to Visual Studio has very nice graphical features that give the statistics you want. But I have the feeling that you have a Linux machine so Nsight won't work.
I suggest using nVidia Visual Profiler.
The metrics reference is fairly complete and can be found here. This is how I would gather the data you are interested in:
Active SMX units - look at sm_efficiency. It should be close to 100%. If it's lower, then some of the SMX units are not active.
Active cores / SMX - This depends. K20 has a Quad-warp scheduler with dual instruction issue. A warp fires 32 SM cores. K20 has 192 SP cores and 64 DP cores. You need to look at ipc metric (instructions per cycle). If your program is DP and IPC is 2 then you have 100% utilization (for the entire workload execution). That means that 2 warps scheduled instructions so all your 64 DP cores were active during all the cycles. If your program is SP, your IPC theoretically should be 6. However in practice this is very hard to get. An IPC of 6, means that 3 of the schedulers launched 2 warps each, and gave work to 3 x 2 x 32 = 192 SP cores.
FLOPS - Well, if your program uses floating point operations, then I would look to flop_count_sp and divide it by the elapsed seconds.
Regarding frequency, I wouldn't worry but it doesn't harm to check with nvidia-smi. If your card has enough cooling then it will stay at peak frequency while running.
Check the metrics reference as it will provide you much more useful information.
I think NVprof also supports multiple processes. Check here. You can also filter by process ID. So you can collect these metrics "multi-context" or "single-context". In the metrics reference table, you have a column that states if they can be collected in both the cases.
Note: The metrics are computed using the HW performance counters, and driver level analysis. If nvidia tools cannot provide more than this, then it's not likely that other tools will be able to offer more. But I think that properly combining the metrics can tell you everything you want about your app run.

qemu performance same with and without multi-threading and inconsistent behaviour

I am new to qemu simulator.I want to emulate our existing pure c h264(video decoder)code in arm platform(cortex-a9) using qemu in ubuntu 12.04 and I had done it successfully from the links available in the internet.
Also we are having multithreading(pthreads) code in our application to speed up the process.If we enable multithreading we are getting the same performance (i.e)single thread(without multithreading).
Eg. single thread 9.75sec
Multithread 9.76sec
Since qemu will support parallel processing we are not able to get the performance.
steps done are as follows
1.compile the code using arm-linux-gnueabi-toolchain
2.Execute the code
qemu-arm -L executable
3.qemu version 1.6.1
Is there any option or settings has to be done in qemu if we want measure the performance in multi threading because we want to get the difference between single thread and multithread using qemu since we are not having any arm board with us.
Moreover,multithreading application hangs if we run for third time or fourth time i.e inconsistent behaviour in qemu.
whether we can rely on this qemu simulator or not since it is not cycle accurate.
You will not be able to use QEMU to estimate real hardware speed.
Also QEMU currently supports SMP running in a single thread... this means your guest OS will see multiple CPUs but will not recieve adicional cycles since all the emulation is occuring in a single thread.
Note that IO is delegated to separate threads... so usually if your VM is doing cpu and IO work you will see at least 1.5+ cores on the host being used.
There has been alot of research into parallelizing the cpu emulation in qemu but without much sucess. I suggest you buy some real hardware and run it there especially consiering that coretex-a9 hardware is cheap these days.

Kernel time increases for same number of particles

I am trying to run my code on NVIDIA's K10 GPU. I am using 5.0 CUDA Driver and 4.2 CUDA runtime. The problem is that the time taken by the kernel increases with iterations, where each iteration uses the same number of sources and targets (or particles). Because of this, the kernel eventually takes very large times, and the code crashes with runtime error, which says something like "GPU fallen off the bus".
The plot showing the behavior of increasing kernel run time with number of iterations can be seen here:
https://docs.google.com/open?id=0B5QLL4ig3LVqODdmVjNBTlp5UFU
I tried to run the NVIDIA "nbody" example to understand if the same thing happens here too, and yes it does. For the number of particles/bodies (Np) = 1e5 and 10 iterations, code runs fine. For Np=1e5 and iterations= 100, OR Np=1e6 and iterations = 10, code goes into a mode where it hangs the entire system.
When I run my own kernel as well as NVIDIA's nbody example on a different machine with Tesla C2050 NVIDIA card (CUDA Driver version: 3.2, and runtime version: 3.2), there is no problem, and kernel takes the same amount of time for every iteration.
I am trying to understand whats going on in the machine with the K10 GPU. I have tried different combinations of CUDA driver and runtime versions on this machine, and here is what I get:
For 5.0 CUDA Driver, 4.2 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
For 4.2 CUDA Driver, 4.2 Runtime, the codes (nbody as well as my code) crash with error: "CUDA Runtime API error 39: uncorrectable ECC error encountered."
For 5.0 CUDA Driver, 5.0 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
This is a 64-bit linux machine, which we have recently assembled with NVIDIA K10 GPU card. I am using gfortran44 and gcc44.
Please let me know if any other info. is required to track the problem.
Thanks in advance for the help!
M
I'm mostly just creating an answer so we can call this question closed, but I'll try to add a few details.
Tesla GPUs come in 2 distinct categories: those with a fan, and those without. Those with a fan carry (at this time) the "C" designation, although the K20 product family naming will be slightly different:
These are not exhaustive lists:
Tesla GPUs with a Fan: C870, C1060, C2050, C2070, C2075, K20c ("C Class")
Tesla GPUs without a Fan: M1060, M2050, M2070, M2075, M2090, K10, K20, K20X ("M class")
(note that there is currently no K10 type product with a fan or "C" designation)
Tesla GPUs with a fan are designed to be plugged into a wide variety of PC boxes and chassis, including various workstation and server variants. Since they have their own fan, they require a supply of inlet air that is below a certain temperature level, but given that, they will keep themselves cool. As the workload increases, and the generated heat increases, they will spin up their own fan to keep themselves cool. The main ways you can screw up this process are by either restricting the inlet air flow or by putting it in an ambient air environment that is hotter than its max inlet spec.
Tesla GPUs without a fan have something called a passive heatsink and they cannot keep themselves cool independently and take a passive role in the cooling process. They still have a temperature sensor, but it becomes the responsibility of the server BMC (baseboard management controller) to monitor this temperature sensor (this is done directly at the hardware/firmware level, independent of any OS or any activity being directed at the GPU), and to direct a level of airflow over the card that is sufficient to keep the card cool based on it's indicated temperature. The BMC does this by ramping up whatever fans are designed into the server chassis that control airflow over the GPU. Normally there will be shrouding/ducting within the chassis to aid in this process. Server manufacturers integrating these cards have a variety of responsibilities and must follow various technical specifications from NVIDIA in order to make this work.
If you happen to get your hands on a Tesla GPU without a fan and just slap it in some random chassis, you're pretty much guaranteed to have the behavior as described in this question. For this reason, Tesla "M" series and "K" series GPUs are normally only sold to OEMs who have undergone the qualification process.
Since the average sysadmin/system assembler is not likely to devise a suitable closed loop fan control system and normally does not have easy access to the necessary specifications defining the temperature sensor and access method, the only klugey workaround if you have one of these that you simply must play with, is to direct a high level of continuous airflow over the card, in whatever setting you put it. Be advised, that this will most likely be noisy. If you don't have a noisy level of airflow, you probably do not have enough airflow to keep a card cool that is in a high workload situation. In addition, you should probably keep an eye on GPU temps. Note that the nvidia-smi method for monitoring GPU temps does not work for all M class GPUs (i.e. GPUs without a fan). Unfortunately, the method of temperature sensor access in Fermi and prior for the M class GPUs (different than the C class GPUs) was such that it could not be readily monitored in-system via the nvidia-smi command, so in these cases you will get no temperature reading from nvidia-smi, making this approach even harder to manage. Things changed with the Kepler generation, so now the temperature can be monitored both by the nvidia-smi method and by the server BMC at the hardware/firmware level.
C class products with a fan have a temperature that can be monitored with nvidia-smi, regardless of generation. But this is normally not necessary since the card has it's own control system to keep itself cool.
As mentioned in the comments, all GPUs also have a variety of protection mechanisms, none of which are guaranteed to prevent damage. (If you throw the card in a fire, there's nothing to be done about that.) But the first typical mechanism is thermal throttling. At some predefined high temperature near the maximum safe operating range of the GPU, the GPU firmware will independently reduce its clocks to attempt to prevent further temperature rise. (If the card is clocked slower, then generally it's ability to generate heat is also somewhat reduced.) This is a crude mechanism, and when this thermal throttling occurs, something in the cooling arena is already wrong. The card is designed to not enter thermal throttling ever, under normal operating conditions. If temperatures continue to rise (and there is not much headroom at this point), the card will enter it's final protection mode which is to halt itself. At this point the GPU has become unresponsive to the system, and at the OS level, messages like "gpu has fallen of the bus" are typical. This means cooling has failed and protection mechanisms have failed.