How are functions modified at run-time then propagated to multiple threads? - function

With Clojure (and other Lisp dialects) you can modify running code. So, when a function is modified during runtime is that change made available to multiple threads?
I'm trying to figure out how it works technically in a concurrent setting: if several threads are using a function foo, what happens when I redefine (say using defn) the function foo?
There has to be some synchronization going on: when and how does such synchronization happen and what does it cost?
Say on a JVM, is the function referenced using a volatile reference? If so, does it mean every single time there's a "function lookup" then one has to pay the volatile cost?

In Clojure functions are instances of the IFn class and they are almost always stored in vars. vars are Clojures mechanism for thread local values.
when you define a function that sets the "root binding" of the var to reference the function
threads other threads get whatever the the current value of the root binding for the var but can't change the value. this prevents any two threads from having to fight over the value of the var because only the root thread can set the value.
threads can choose to use a new value of the var if they need to, but calling binding which gives then their own thread local value that they are free to change at will because no other thread can read it.
A good understanding of vars is well worth a little study, they are a very useful concurrency device once you get used to them.
ps: the root thread is usually the REPL
pss: you are of course free to store your functions in something other than vars, if for instance you needed to atomically update a group of functions, though this is rare.

Related

How to detect function pointers from assignment statements in LLVM IR?

I want to detect all occurences of function pointers, calls as well as assignments. I am able to detect indirect calls but how to do the assignment ones? Is there a char Iterator in the Instruction.h header or something which iterates character by character for each instruction in the .ll file?
Any help or links will be appreciated.
Start with unoptimized LLVM IR, then scan the use-list for all functions.
In the comments we established that this is for a school project, not a production system. The difference when building a production system is that you should rely only on the properties that are guaranteed, while for research or prototyping or schoolwork, you can build something that happens to work by relying on properties that are not guaranteed. That's what we're going to do here.
It so happens (but is not guaranteed!) that as clang converts C to LLVM IR, it will[1] emit a store for every function pointer used. As long as we don't run the LLVM optimizer, they will still be there.
The "forwards" direction would be to look into instructions and see whether any of them do the action you want (assign or call a function pointer). I've got a better idea: let's do it backwards. Every llvm::Value has a list of all places where that value is used, called the use-list. For example %X = add i32 %Y, %Z, %X would appear in the use-list of %Y because %X is using %Y.
Starting with your Module, iterate over every function (ie. for (auto &F : M->functions()) {), then scan the use-list for the function (ie. for (const auto *U : F.users())) and look at those Values (ie. U.get()). If the user value is an Instruction, you can query which Function this Instruction belongs to -- the Instruction has a parent BasicBlock and the BasicBlock has a parent Function. If it's a ConstantExpr, you'll need to recurse through that ConstantExpr's use-list until you find all the instructions using those constants. If the instruction is a direct call, you want to skip it (unless it's a direct call with an argument that is also a function pointer, like somefun(1, &somefun, 2);).
This will miss any code that uses C typed function pointers but never point to any functions, for instance C code like void (*fun_ptr)(int) = NULL;. If this is important to you then I suggest writing a Clang AST tool with the RecursiveASTVisitor instead of using LLVM.
[1] clang is free not to, for instance given if (0) { /* ... */ } clang could skip emitting LLVM IR for the disabled code because it's faster to do that. If your function pointer use was in there, LLVM would never get the opportunity to see it.

In CUDA, how can I get this warp's thread mask in conditionally executed code (in order to execute e.g., __shfl_sync or <cg>.shfl?

I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
Shuffle/broadcast to whatever threads happen to be here at this time.
Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:
Shuffle/broadcast to whatever threads happen to be here at this time.
The blog states that this is incorrect usage:
//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) {
unsigned mask = __activemask();
val = input[threadIdx.x];
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(mask, val, offset);
(which is conceptually similar to the usage given in your linked question.)
But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:
Shuffle/broadcast to whatever threads happen to be here at this time.
If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.
Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.
If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.

Cuda _sync functions, how to handle unknown thread mask?

This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.
For example, int __all_sync(unsigned mask, int predicate);
with semantics:
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
The docs assume that the caller knows which threads are active and can therefore populate mask accurately.
a mask must be passed that specifies the threads participating in the call
I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:
__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);
The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:
For example, this is illegal (will result in undefined behavior for warp 0):
if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);
The second choice, this time quoting from __activemask() vs __ballot_sync()
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
The operating semantics appear to be:
There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged
That suggests synchronising threads then using activemask,
__syncwarp();
__all_sync(__activemask(), predicate);
An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.
How do I retrieve semantically correct behaviour without that global transform?
TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.
Longer:
This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
That should be perfectly legal.
You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.
You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)
I think you have only 2 options:
grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).
There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.

Most efficient way to track x and y values of multiple object instances on the stage?

I have an arbitrary number of object instances on the stage. At any one given time the number of objects may be between 10 and 50. Each object instance can move, but the movement is gradual, the current coordinates are not predictable and at any given moment I may need to retrieve the coordinates of a specific object instance.
Is there a common best-practice method to use in this case to track object instance coordinates? I can think of two approaches:
I write a function within the object class that, upon arbitrary event execution, is called on an object instance and returns that object instances coordinates.
Within the object class I declare global static variables that represent x and y values and, upon arbitrary event execution, the variables are updated with the latest values for that object instance.
While I can get both methods to work, I do not know whether one or the other would be detrimental to program performance in the long run. I lean toward the global variables because I expect it is less resource intensive to update and call a variable than to call a function which subsequently updates and calls a variable. Maybe there is even a third option?
I understand that this is a somewhat subjective question. I am asking with respect to resource consumption so please answer in that respect.
I don't understand.. The x and y properties are both stored on the object (if it's a DisplayObject) and readable.. Why do you need to store these in a global or whatever?
If you're not using DisplayObject as a base, then just create the properties yourself with appropriate getters.
If you want to get the coordinates of all your objects, add them to an array, let's say objectList.
Then just use a loop to check the values:
for each(var i:MovieClip in objectList)
{
trace(i.x, i.y);
}
I think I'm misunderstanding the question, though.
definitely 1.
for code readability use a get property, ie
public function get x():Number { return my_x; }
The problem with 2, is you may well also need to keep track of which object those coords are for - not to mention it is just messy... Globals can get un-managable quickly, hence all this reesearch into OOP and encapsuilation, and doing away with (mostly) the need for globals..
with only 50 or less object - don't even consider performance issues...
And remember that old mantra - "Premature optimisation is the root of programming evil" ;-)

What is (or should be) the cyclomatic complexity of a virtual function call?

Cyclomatic Complexity provides a rough metric for how hard to understand a given function is, or how much potential for containing bugs it has. In the implementations I've read about, usually all of the basic control flow constructs (if, case, while, for, etc.) increase the complexity of a function by 1. It appears to me given that cyclomatic complexity is intended to determine "the number of linearly independent paths through a program's source code" that virtual function calls should increase the cyclomatic complexity of a function as well, because of the ambiguity of which implementation will be called at runtime (the call creates another branch in the path of execution).
However, penalizing the function the same amount that one would if it contained an equivalent switch statement (one point for every 'case' keyword, with one case keyword for every class in the hierarchy implementing the virtual function in question) feels overly harsh, because a virtual function call is generally regarded as much better programming practice.
What should the cost in cyclomatic complexity of a virtual function call be? I'm not sure if my reasoning is an argument against the utility of cyclomatic complexity as a metric or one against the use of virtual functions or something different.
Edit: After people's responses I realized that it shouldn't add to cyclomatic complexity because we could consider the virtual function call equivalent to a call to a global function that contains the massive switch statement. Even though that function will get a bad score, it only exists once in the program, whereas replacing each virtual function call directly with switch statement would cause the cost many times.
Cyclomatic complexity usually does not apply across function call boundaries, but is an intra-function metric. Hence, virtual calls do not count any more than non-virtual, static function calls.
A virtual function call does not increase the cyclomatic complexity, because the "ambiguity [over] of which implementation will be called" is external to the function call. Once the objects value is set, there is no ambiguity. We know exactly what methods will be called.
BaseClass baseObj = null;
// this part has multiple paths & add to CC
if (x == y)
baseObj = new Derived1();
else
baseObj = new Derived2();
// this part has one path and does not add to the CC
baseObj.virtualMethod1();
baseObj.virtualMethod2();
baseObj.virtualMethod3();
virtual function calls should increase
the cyclomatic complexity of a
function as well, because of the
ambiguity of which implementation will
be called at runtime
Ah, but it isn't ambiguous at runtime (unless you're doing metaprogramming / monkey patching); it's completely determined by the type/class of the receiver.
I'm not a big fan of cyclomatic complexity, but in this case you're calling a function. It will do approximately the same thing (unless the class hierarchy design is really screwed up), with some variations depending on that calls it. Thing is, if you call any function, you can get some varied behavior depending on the arguments you pass in, and this isn't counted in CC.
Therefore, I'd completely ignore that cost.