How to detect AVX512 instruction is executed on Qemu?
of course we can specify the instruction by "-cpu option"
But I want to detect the instruction is executed.
Related
I would like to implement a runtime that allows a thread in GPU to obtain the value of the program counter. I have searched in Cuda Programming Guide and PTX guide as well.
With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity.
May I know where the value of the program counter is stored? In what register assigned to that thread? Is there a way that allows the thread to obtain the value of the program counter?
If a thread cannot do it, then can we modify the driver, or compiler, or using inline assembly to achieve that?
May I know where the value of the program counter is stored? In what
register assigned to that thread?
The PTX virtual machine instruction only exposes a limited number of special registers.
At the time of writing, that doesn't include the program counter or the stack frame, as far as I am aware. This appears to be true whether the GPU architecture is using a block wide PC or an independent per-thread PC.
Is there a way that allows the thread to obtain the value of the
program counter?
Not as far as I am aware.
Are there any CPU-state bits indicating being in an exception/interrupt handler in ARM Cortex-A processors (like e.g. IPSR reister in ARM Cortex-M CPUs)? In other words, can we tell whether the main thread or exception handler is currently executed based only on the CPU registers' state?
The CPSR mode field indicates what mode the processor is currently executing in. You cannot act on it directly, you have to move it into a gpr to examine it.
i would like to know how CUDA hardware/run-time system handles the following case.
If a warp (warp1 in the following) instruction involves access to global memory (load/store); the run-time system schedules the next ready warp for execution.
When the new warp is executed,
Will the "memory access" of warp1 be conducted in parallel, i.e. while the new warp is running ?
Will the run time system put warp1 into a memory access waiting queue; once the memory request is completed, the warp is then moved into the runnable queue?
Will the instruction pointer related to warp1 execution be incremented automatically and in parallel to the new warp execution, to annotate that the memory request is completed?
For instance, consider this pseudo code output=input+array[i]; where output and input are both scalar variables mapped into registers, whereas array is saved in the global memory.
To run the above instruction, we need to load the value of array[i] into a (temporary) register before updating output; i.e the above instruction can be translated into 2 macro assembly instructions load reg, reg=&array[i], output_register=input_register+reg.
I would like to know how the hardware and runtime system handle the execution of the above 2 macro assembly instructions, given that load can't return immediately
I am not sure I understand your questions correctly, so I'll just try to answer them as I read them:
Yes, while a memory transaction is in flight further independent instructions will continue to be issued. There isn't necessarily a switch to a different warp though - while instructions from other warps will always be independent, the following instructions from the same warp might be independent as well and the same warp may keep running (i.e. further instructions may be issued from the same warp).
No. As explained under 1. the warp can and will continue executing instructions until either the result of the load is needed by dependent instruction, or a memory fence / barrier instruction requires it to wait for the effect of the store being visible to other threads.
This can go as far as issuing further (independent) load or store instructions, so that multiple memory transactions can be in flight for the same warp at the same time. So the status of a warp after issuing a load/store doesn't change fundamentally and it is not halted until necessary.
The instruction pointer will always be incremented automatically (there is no situation where you ever do this manually, nor are there instructions allowing to do so). However, as 2. implies, this doesn't necessarily indicate that the memory access has been performed - there is separate hardware to track progress of memory accesses.
Please note that the hardware implementation is completely undocumented by Nvidia. You might find some indications of possible implementations if you search through Nvidia's patent applications.
GPUs up to the Fermi generation (compute capability 2.x) tracked outstanding memory transaction completely in hardware. While undocumented by Nvidia, the common mechanism to track (memory) transactions in flight is scoreboarding.
GPUs from newer generations starting with Kepler (compute capability 3.x) use some assistance in the form of control words embedded in the shader assembly code. While again undocumented, Scott Gray has reversed engineered these for his Maxas Maxwell assembler. He found that (amongst other things) the control words contain barrier instructions for tracking memory transactions and was kind enough to document his findings on his Control-Codes wiki page.
What does the ThreadX kernel enter function do?
What does it mean that this function does not return?
How are the threads created in the tx_application_define function scheduled and executed?
ThreadX kernel-enter routine performs the following:
If ThreadX initialization needs to take place:
Call any port specific pre-processing.
Invoke the low-level initialization to handle all processor specific initialization issues.
Invoke the high-level initialization to exercise all of the ThreadX components.
Call any port specific post-processing.
Call the user-provided initialization function tx_application_define.
Call any port specific pre-scheduler processing.
Enter the scheduling loop to start executing threads.
To answer your questions:
In step #2, the ThreadX kernel-enter routine calls function tx_application_define, which is up to you to implement. It is pretty similar in essence to a user-callback routine, except for the fact that it is not provided as a function-pointer (i.e., the tx_application_define symbol is resolved during link-time instead of during runtime). This function is where you should typically create all the threads.
In step #4, the ThreadX kernel-enter routine starts an infinite loop, which is in essence the scheduler itself. This is where all the context switches are managed, and the threads go in and out of execution. Upon every HW interrupt, the PC (program-counter) jumps from the currently executing thread to the IV (interrupt-vector), and from there to the connected ISR (interrupt-service-routine). After that, it jumps back to the scheduler (i.e., into the infinite loop), which determines whether a context-switch is required or not. Execution eventually returns to the last executing thread or to some other thread, depending on the scheduler decision.
As you can understand, every context-switch is the result of a HW interrupt, but not every HW interrupt results with a context-switch. You should typically refrain from enabling the interrupts (by calling function __enable_interrupt from within function tx_application_define), as the ThreadX kernel-enter routine makes sure of that just before it enters the scheduling loop.
When executing this instruction I got an exception
LFS ESI,PWORD PTR [EBP+12]
From this page http://wiki.osdev.org/Double_Fault#Double_Fault
Any PUSH or POP instruction or any instruction using ESP or EBP as a base register is executed, while the stack address is not in canonical form.
So i think it should be an Stack-Segment Fault here.
But the system gives an general protection exception(0D).
Could anyone tell me why the result is this?
General protection fault for an LFS occurs when:
the segment selector index you are
trying to load is not with the
descriptor table limits
the segment is in the descriptor
table, but it's not a readable data
segment
your privilege level is higher
(meaning less privilege) that the
privilige level for the descriptor.
So, the problem is not the instruction itself, but the segment descriptor table.
See chapter 3 in the Intel Software Developer’s Manual Volume 3A:
http://www.intel.com/products/processor/manuals/?wapkw=(Intel+64+and+IA-32+Architectures)