Uniquely identify instruction in LLVM - unique

Is there a way to uniquely identify given instruction in LLVM.
For example, if we have a function and inside this function we have "add" instruction, is there a way to say that this is the same "add" instruction every time we execute it?
Thank you

What do you mean by "execute" it?
If you execute it in the interpreter (via lli), then yes, each LLVM instruction is a unique Value*.
If you compile it down to machine code and execute it, you can use debug info to correlate between the LLVM instruction and the machine code generated from it. It's trickier, but doable. Some complications like multiple machine code instrs generated from the same LLVM instruction, etc.

Related

How to access executing instruction binary code (e.g., opcode) from C helper functions in QEMU

I would like to modify the QEMU emulator's behaviour when it executes some assembly instructions of a target architecture (e.g., RISC-V) running ontop of a host (e.g., x86).
My question is, is it possible, to access the information related to the instruction being executed, from the C helper?
There some infos that can be accessed from the context variable pointer, but I wasn't able to access the instruction binary code for example. Any ideas?
My question is, is it possible, to access the information related to
the instruction being executed, from the C helper?
Yes. One way to do it is to pass this information to helper from a translator function. Another is to use cpu_ld*_code (e.g. cpu_ldub_code) in the helper to fetch instruction from memory.

When is a command's compile function called?

My understanding of TCL execution is, if a command's compile function is defined, it is first called when it comes to execute the command before its execution function is called.
Take command append as example, here is its definition in tclBasic.c:
static CONST CmdInfo builtInCmds[] = {
{"append", (Tcl_CmdProc *) NULL, Tcl_AppendObjCmd,
TclCompileAppendCmd, 1},
Here is my testing script:
$ cat t.tcl
set l [list 1 2 3]
append l 4
I add gdb breakpoints at both functions, TclCompileAppendCmd and Tcl_AppendObjCmd. My expectation is TclCompileAppendCmd is hit before Tcl_AppendObjCmd.
Gdb's target is tclsh8.4 and argument is t.tcl.
What I see is interesting:
TclCompileAppendCmd does get hit first, but it is not from t.tcl,
rather it is from init.tcl.
TclCompileAppendCmd gets hit several times and they all are from init.tcl.
The first time t.tcl executes, it is Tcl_AppendObjCmd gets hit, not TclCompileAppendCmd.
I cannot make sense of it:
Why is the compile function called for init.tcl but not for t.tcl?
Each script should be independently compiled, i.e. the object with compiled command append at init.tcl is not reused for later scripts, isn't it?
[UPDATE]
Thanks Brad for the tip, after I move the script to a proc, I can see TclCompileAppendCmd is hit.
The compilation function (TclCompileAppendCmd in your example) is called by the bytecode compiler when it wants to issue bytecode for a particular instance of that particular command. The bytecode compiler also has a fallback if there is no compilation function for a command: it issues instructions to invoke the standard implementation (which would be Tcl_AppendObjCmd in this case; the NULL in the other field causes Tcl to generate a thunk in case someone really insists on using a particular API but you can ignore that). That's a useful behaviour, because it is how operations like I/O are handled; the overhead of calling a standard command implementation is pretty small by comparison with the overhead of doing disk or network I/O.
But when does the bytecode compiler run?
On one level, it runs whenever the rest of Tcl asks for it to be run. Simple! But that's not really helpful to you. More to the point, it runs whenever Tcl evaluates a script value in a Tcl_Obj that doesn't already have bytecode type (or if the saved bytecode indicates that it is for a different resolution context or different compilation epoch) except if the evaluation has asked to not be bytecode compiled by the flag TCL_EVAL_DIRECT to Tcl_EvalObjEx or Tcl_EvalEx (which is a convenient wrapper for Tcl_EvalObjEx). It's that flag which is causing you problems.
When is that flag used?
It's actually pretty simple: it's used when some code is believed to be going to be run only once because then the cost of compilation is larger than the cost of using the interpretation path. It's particularly used by Tk's bind command for running substituted script callbacks, but it is also used by source and the main code of tclsh (essentially anything using Tcl_FSEvalFileEx or its predecessors/wrappers Tcl_FSEvalFile and Tcl_EvalFile). I'm not 100% sure whether that's the right choice for a sourced context, but it is what happens now. However, there is a workaround that is (highly!) worthwhile if you're handling looping: you can put the code in a compiled context within that source using a procedure that you call immediately or use an apply (I recommend the latter these days). init.tcl uses these tricks, which is why you were seeing it compile things.
And no, we don't normally save compiled scripts between runs of Tcl. Our compiler is fast enough that that's not really worthwhile; the cost of verifying that the loaded compiled code is correct for the current interpreter is high enough that it's actually faster to recompile from the source code. Our current compiler is fast (I'm working on a slower one that generates enormously better code). There's a commercial tool suite from ActiveState (the Tcl Dev Kit) which includes an ahead-of-time compiler, but that's focused around shrouding code for the purposes of commercial deployment and not speed.

NVRTC and __device__ functions

I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__), in order to "override" an existing function?
I am pretty sure the really short answer is no.
Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.
So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.
It looks like you have a "main" kernel with a part that is "switchable" at run time.
You can definitely do this using nvrtc. You'd need to go about doing something like this:
Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
You can generate the appropriate "myFunc" kernel based on equations at run time.
Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.
That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.
There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.

cuda trace emulation -Need some expert insight

I am working on a gpu trace emulation tool in windows as part of my research work in grad school . I am working on cuda runtime trace emulation to be specific.
I use simple DLL injection using MS Detours to enable interception of the cuda runtime APIs. I store the API calls and their parameters in a trace file. I get into some problems while trying to emulate the API from my trace file(I use the word playback to denote this action)
A typical trace file begins by making calls to __cudaRegisterFatBinary and __cudaRegisterFunction. This is followed by a call to cudaMalloc.
What I did?
1) I came across the famous GPUOcelot and I found the cubin structure that Nvidia is using right now. I am using that to save the address parameter of cudaRegisterFatBinary in intercept mode and I am using the pointer in the playback for _cudaRegisterFatBinary by repopulating the structure in the memory.
2)In _cudaRegisterFunction I am not sure what the parameters hostFunction,Device Function and Device Name refer to. I mean I don't understand how I could populate it while playing back from my trace file. I am just saving the pointer from the original execution and using it to imitate the call. But there is no way of knowing whether the function goes through fine since it does not have a return value.
3)cudaMalloc following these two entry point functions return cuda error code 11. It is cuda invalid value according to the Nvidia documentation. I have no idea why this should be the case. I am assuming that something is wrong with the previous two function calls. I also have a feeling that something is wrong with implicit primary context creation by the cuda runtime. Can someone give me some insights about cuda runtime execution and point me to what might I be missing?
I know its a ton of information without any useful code. I dont know which part of the code to post here. I will do it when people start taking interest in my question and ask me specific things about my project. Initially am just hoping that I am missing something big and high level that one of you can spot.
I greatly appreciate your time and interest!
Sounds very interesting overall. Your "Error:Cuda invalid value" is could be related to the params of _cudaRegisterFunction. The param 'DeviceName' sounds like it identifies which GPU (card?) to use. Check the CUDA SDK, there are many demos that enumerate the GPUs on the system, perhaps these values are valid for 'DeviceName'. As for 'hostFunction' and 'deviceFunction', these sound like either function IDs, or perhaps function pointers. Also, you can call 'cudaGetLastError()' to test whether the function call was successful (it returns 'cudaSuccess' if everything is ok... take a look at the error logging macros in the sdk). Good luck!

adding functions in a CUDA program

So, I think I have a very weird question.
So, let say that I already have a program put on my GPU and in that program I call a function X. But that function X is not declared yet.
I want to be able, dynamically, to modify that function X, by completely changing the code and put it in the program without recompiling the rest or losing any pointers whatsoever.
To compare it with something that most of us know, I want to be able to do like the shaders in OpenGL. In the middle of the execution, I can change the code of one shader, only recompile that shader, activate the program and now I used this one.
So, is it possible. Or do I need to recompile the whole thing all the time. And if I have to recompile, do I lose the various arrays that I created in global memory ?
Thanks
W
If you compile with the -cuda flag using nvcc, you can get the intermediate C++ source that streams PTX to the processor. In theory, you could post-process this intermediate output to dynamically generate PTX on the fly and send it over. You might even be able to have PTX be self modifying, but that's way out of my league.