I am working on threads. I need to find how many threads qemu is using. Any idea about it
Any idea about it.
Follow this in order to find no. of threads used by qemu
http://blog.vmsplice.net/2011/03/qemu-internals-overall-architecture-and.html
Related
I am new to qemu simulator.I want to emulate our existing pure c h264(video decoder)code in arm platform(cortex-a9) using qemu in ubuntu 12.04 and I had done it successfully from the links available in the internet.
Also we are having multithreading(pthreads) code in our application to speed up the process.If we enable multithreading we are getting the same performance (i.e)single thread(without multithreading).
Eg. single thread 9.75sec
Multithread 9.76sec
Since qemu will support parallel processing we are not able to get the performance.
steps done are as follows
1.compile the code using arm-linux-gnueabi-toolchain
2.Execute the code
qemu-arm -L executable
3.qemu version 1.6.1
Is there any option or settings has to be done in qemu if we want measure the performance in multi threading because we want to get the difference between single thread and multithread using qemu since we are not having any arm board with us.
Moreover,multithreading application hangs if we run for third time or fourth time i.e inconsistent behaviour in qemu.
whether we can rely on this qemu simulator or not since it is not cycle accurate.
You will not be able to use QEMU to estimate real hardware speed.
Also QEMU currently supports SMP running in a single thread... this means your guest OS will see multiple CPUs but will not recieve adicional cycles since all the emulation is occuring in a single thread.
Note that IO is delegated to separate threads... so usually if your VM is doing cpu and IO work you will see at least 1.5+ cores on the host being used.
There has been alot of research into parallelizing the cpu emulation in qemu but without much sucess. I suggest you buy some real hardware and run it there especially consiering that coretex-a9 hardware is cheap these days.
I am new to MPI. I want use CUDA with MPI. I am having three PCs, each having one GPU, which I want to use for doing some simple processing (matrix multiplication).
But I am not sure what hardware setup is required to use MPI with CUDA?
Please enlighten me.
Update
I am asking this as many a place mentions clusters with infiniband. I do not have such a set up. I only have ordinary Lan that we have in offices.
And above all the basic idea is to have a feel of how MPI and CUDA work together and do small small tests runs--irrespective of the performance.
One or more machines with nVidia GPUs that are capable of CUDA.
MPI and CUDA don't have anything to do with each other. You simply use CUDA within each MPI process.
But, but way of a followup to the OP's original question, if I may?
I realize that #gpuguy's Q was about hardware, but isn't it true that he must be running one of the OS options the Nvidia CUDA compilers supports? (IE, Linux, Win, OSX)
There is no OpenSource equivalent of CUDA, is there?
I think my kernel is memory bound (because most GPGPU code is memory bound), but I don't actually know for sure. How can I found it out for myself. Probably one has to use the visual profiler, as it depends on the used GPU.
If it is explained in the CUDA Programming guide or in other NVIDIA documentation, don't hesitate to just post a link with a page number, so I can read it up for myself.
Clarification
I would prefer are general "rule" how to determine the limiting factor, but in my special case you can find details about my kernel here: Using `overlap`, `kernel time` and `utilization` to optimize one's kernels
This presentation from NVIDIA talks about selectively disabling memory accesses and arithmetic in your kernel by modifying your source code, in order to determine if one of them is limiting your performance.
A nice trick without any source code modification can be used for code compiled with compute capability 2.0 and above ( based on answer here )
using the "--use_fast_math" flag one can easily increase\decrease compute pressure.
if setting this flag gives a large speed-up, this would indicate a compute bound kernel.
if setting this flag gives little to no speed-up, this would indicate a balanced\memory bound kernel.
I though I would pitch in an answer even though there is an accepted answer and this question is old.
I had a similar problem in my code, although at the time I didn't know it.
I ran the Nvidia Visual Profiler (nvvp) and analysed my program. I found that the profiler had detected my program was limited in some fashion and had some recommendations.
A great tool to use if you are unsure on where to begin.
I have a problem that is seemingly solvable by enumerating all possible solutions and then finding the best. In order to do so, I devised a backtracking algorithm that enumerates and stores the best solution if found. It works fine so far.
Now, I wanted to port this algorithm to CUDA. Therefore, I created a procedure that generates some distinct basic cases. These basic cases should be processed in parallel on the GPU. If one of the CUDA-threads finds an optimal solution, all the other threads can - of course - stop their work.
So, I wanted kind of the following: The thread that finds the optimal solution should stop all running CUDA-threads of my program, thus finishing calculation.
After some quick search, I found that threads can only communicate if they are in the same block. (So I suppose it's impossible to stop others blocks threads.)
The only method I could think of is that I have a dedicated flag optimum_found, which is checked at the beginning of every kernel. If an optimum solution is found, this flag is set to 1, so all future threads know that they do not have to work. But of course, threads already running do not notice this flag if they do not check it at every iteration.
So, is there a possibility to stop all remaining CUDA-threads?
I think that your method of having a dedicated flag could work provided that it was a memory location in global memory. That way you can check this, as you said, at the beginning of each kernel call.
Kernel calls should generally be relatively short anyways, therefore letting the other threads in a batch finish even though an optimal solution was found by one of those threads shouldn't affect your performance too much.
That said, I am fairly sure there is no CUDA call that can kill off other actively executing threads.
I think Ian has the right idea here. Optimum performance would come from minimal memory transfers and branching. Writing to global memory and checking flags (branching) goes against the CUDA best practices guide and will reduce your speedup.
You might want to look at callbacks. The main CPU thread can make sure all threads run in the right order. CPU callback threads (read: postprocessing) can do additional overhead and call the related api functions as well as disposing all of the sub thread data... This feature is found in cuda samples and compiles on cuda capability 2. Hope this helps.
how can I find out the number of registers cuda kernel is using during run time?
I know how to find out information during the compilation, but I do not want to hardcode numbers in
thanks
I don't think it's possible with CUDA 2.x to get the information at run time. Looking at the documentation for the new 3.0 beta, it seems that cudaFuncGetAttributes will do what you want.
I think, the one you see in the compilations are the one that are going to be used at runtime, or at least the maximum number of registers used at runtime.