Tcl_DoOneEvent is blocked if tkwait / vwait is called - tcl

There is an external C++ function that is called from Tcl/Tk and does some stuff in a noticeable amount of time. Tcl caller has to get the result of that function so it waits until it's finished. To avoid blocking of GUI, that C++ function has some kind of event loop implemented in its body:
while (m_curSyncProc.isRunning()) {
const clock_t tm = clock();
while (Tcl_DoOneEvent(TCL_ALL_EVENTS | TCL_DONT_WAIT) > 0) {} // <- stuck here in case of tkwait/vwait
// Pause for 10 ms to avoid 100% CPU usage
if (double(clock() - tm) / CLOCKS_PER_SEC < 0.005) {
nanosleep(10000);
}
}
Everything works great unless tkwait/vwait is in action in Tcl code.
For example, for dialogs the tkwait variable someVariable is used to wait Ok/Close/<whatever> button is pressed. I see that even standard Tk bgerror uses the same method (it uses vwait).
The problem is that once called Tcl_DoOneEvent does not return while Tcl code is waiting in tkwait/vwait line, otherwise it works well. Is it possible to fix it in that event loop without total redesigning of C++ code? Because that code is rather old and complicated and its author is not accessible anymore.

Beware! This is a complex topic!
The Tcl_DoOneEvent() call is essentially what vwait, tkwait and update are thin wrappers around (passing different flags and setting up different callbacks). Nested calls to any of them create nested event loops; you don't really want those unless you're supremely careful. An event loop only terminates when it is not processing any active event callbacks, and if those event callbacks create inner event loops, the outer event loop will not get to do anything at all until the inner one has finished.
As you're taking control of the outer event loop (in a very inefficient way, but oh well) you really want the inner event loops to not run at all. There's three possible ways to deal with this; I suspect that the third (coroutines) will be most suitable for you and that the first is what you're really trying to avoid, but that's definitely your call.
1. Continuation Passing
You can rewrite the inner code into continuation-passing style — a big pile of procedures that hands off from step to step through a state machine/workflow — so that it doesn't actually call vwait (and friends). The only one of the family that tends to be vaguely safe is update idletasks (which is really just Tcl_DoOneEvent(TCL_IDLE_EVENTS | TCL_DONT_WAIT)) to process Tk internally-generated alterations.
This option was your main choice up to Tcl 8.5, and it was a lot of work.
2. Threads
You can move to a multi-threaded application. This can be easy… or very difficult; the details depend on an examination of what you're doing throughout the application.
If going this route, remember that Tcl interpreters and Tcl values are totally thread-bound; they internally use thread-specific data so that they can avoid big global locks. This means that threads in Tcl are comparatively expensive to set up, but actually use multiple CPUs very efficiently afterwards; thread pooling is a very common approach.
3. Coroutines
Starting in 8.6, you can put the inner code in a coroutine. Almost everything in 8.6 is coroutine-aware (“non-recursive” in our internal lingo) by default (including commands you wouldn't normally think of, such as source) and once you've done that, you can replace the vwait calls with equivalents from the Tcllib coroutine package and things will typically “just work”. (For example, vwait var becomes coroutine::vwait var, and after 123 becomes coroutine::after 123.)
The only things that don't have direct replacements are tkwait window and tkwait visibility; you'll need to simulate those with waiting for a <Destroy> or <Visibility> event (the latter is uncommon as it is unsupported on some platforms), which you do by binding a trivial callback on those that just sets a variable that you can coroutine::vwait on (which is essentially all that tkwait does internally anyway).
Coroutines can become messy in a few cases, such as when you've got C code that is not coroutine-aware. The main places in Tcl where these come into play are in trace callbacks, inter-interpreter calls, and the scripted implementations of channels; the issue there is that the internal APIs these sit behind are rather complicated already (especially channels) and nobody's felt up to wading in and enabling a non-recursive implementation.

Related

Using "saved" registers in the main function at RISC-V Assembly

Suppose the simple following main function written in RISC-V Assembly:
.globl main
main:
addi s3,zero,10 #Should this register (s3) be saved before using?
Since s3 is a "saved register", the procedure calling conventions should be followed and thus, this register should be pushed to the stack before using it. However, by looking at the source file, no other procedure has used this register and saving the register to the stack seems redundant.
My question is, should these types of registers be saved every time before every usage even if it means writing more (redundant) code just to obey the calling conventions? Can these conventions sometimes be ignored to improve performance?
In the example above, should the register be saved because it is unknown if the main's caller has been using the s3 register?
Yes, main is a function that has a real caller you return to, and that caller might be using s3 for something.
Unless your main never returns, either being an infinite loop or only exiting by calling exit or a system call. If you never return, you don't need to be able to restore the caller's state, or even find your way back (via a return address).
So if it's just as convenient to call exit instead of ever returning from main, doing that allows you to avoid saving anything.
This also applies in cases where there's nothing for main to return to, of course, so returning wasn't even an option. e.g. if it's the entry point in a kernel or other freestanding code.
Also, I hope you understand that saved every time before every usage means once per function that uses them, not separately around each separate block. And not saving call-clobbered registers around each function call; just let them die.
Can these conventions sometimes be ignored to improve performance?
Yes, if you keep the details invisible to any code you don't control.
If you treat small private helper functions as actually part of one big function, then they can use a "private" custom calling convention. (Even if you do actually call / return instead of just jumping to them, if you want to avoid inlining them at multiple callsites)
Sometimes this is just taking advantage of extra guarantees when you know about the function you're calling. e.g. that it doesn't actually clobber some of its input arg registers. This can be useful in recursion when you're calling yourself: foo(int *p, int a) self calls might take advantage of p still being in the same register unmodified, instead of having to keep p somewhere else for use after the call returns like it would if calling an "unknown" function where you can't assume anything the calling convention doesn't guarantee.
Or if you have a publicly-visible wrapper in front of your actual private recursive function, you can set up some some constants, or even have the recursive function treat one register as a static variable, instead of passing around pointers to some shared state in memory. (That's no longer pure recursion, just a loop that uses the asm stack to keep track of some history that happens to include a jump address.)

When is a command's compile function called?

My understanding of TCL execution is, if a command's compile function is defined, it is first called when it comes to execute the command before its execution function is called.
Take command append as example, here is its definition in tclBasic.c:
static CONST CmdInfo builtInCmds[] = {
{"append", (Tcl_CmdProc *) NULL, Tcl_AppendObjCmd,
TclCompileAppendCmd, 1},
Here is my testing script:
$ cat t.tcl
set l [list 1 2 3]
append l 4
I add gdb breakpoints at both functions, TclCompileAppendCmd and Tcl_AppendObjCmd. My expectation is TclCompileAppendCmd is hit before Tcl_AppendObjCmd.
Gdb's target is tclsh8.4 and argument is t.tcl.
What I see is interesting:
TclCompileAppendCmd does get hit first, but it is not from t.tcl,
rather it is from init.tcl.
TclCompileAppendCmd gets hit several times and they all are from init.tcl.
The first time t.tcl executes, it is Tcl_AppendObjCmd gets hit, not TclCompileAppendCmd.
I cannot make sense of it:
Why is the compile function called for init.tcl but not for t.tcl?
Each script should be independently compiled, i.e. the object with compiled command append at init.tcl is not reused for later scripts, isn't it?
[UPDATE]
Thanks Brad for the tip, after I move the script to a proc, I can see TclCompileAppendCmd is hit.
The compilation function (TclCompileAppendCmd in your example) is called by the bytecode compiler when it wants to issue bytecode for a particular instance of that particular command. The bytecode compiler also has a fallback if there is no compilation function for a command: it issues instructions to invoke the standard implementation (which would be Tcl_AppendObjCmd in this case; the NULL in the other field causes Tcl to generate a thunk in case someone really insists on using a particular API but you can ignore that). That's a useful behaviour, because it is how operations like I/O are handled; the overhead of calling a standard command implementation is pretty small by comparison with the overhead of doing disk or network I/O.
But when does the bytecode compiler run?
On one level, it runs whenever the rest of Tcl asks for it to be run. Simple! But that's not really helpful to you. More to the point, it runs whenever Tcl evaluates a script value in a Tcl_Obj that doesn't already have bytecode type (or if the saved bytecode indicates that it is for a different resolution context or different compilation epoch) except if the evaluation has asked to not be bytecode compiled by the flag TCL_EVAL_DIRECT to Tcl_EvalObjEx or Tcl_EvalEx (which is a convenient wrapper for Tcl_EvalObjEx). It's that flag which is causing you problems.
When is that flag used?
It's actually pretty simple: it's used when some code is believed to be going to be run only once because then the cost of compilation is larger than the cost of using the interpretation path. It's particularly used by Tk's bind command for running substituted script callbacks, but it is also used by source and the main code of tclsh (essentially anything using Tcl_FSEvalFileEx or its predecessors/wrappers Tcl_FSEvalFile and Tcl_EvalFile). I'm not 100% sure whether that's the right choice for a sourced context, but it is what happens now. However, there is a workaround that is (highly!) worthwhile if you're handling looping: you can put the code in a compiled context within that source using a procedure that you call immediately or use an apply (I recommend the latter these days). init.tcl uses these tricks, which is why you were seeing it compile things.
And no, we don't normally save compiled scripts between runs of Tcl. Our compiler is fast enough that that's not really worthwhile; the cost of verifying that the loaded compiled code is correct for the current interpreter is high enough that it's actually faster to recompile from the source code. Our current compiler is fast (I'm working on a slower one that generates enormously better code). There's a commercial tool suite from ActiveState (the Tcl Dev Kit) which includes an ahead-of-time compiler, but that's focused around shrouding code for the purposes of commercial deployment and not speed.

How many arguments are passed in a function call?

I wish to analyze assembly code that calls functions, and for each 'call' find out how many arguments are passed to the function. I assume that the target functions are not accessible to me, but only the calling code.
I limit myself to code that was compiled with GCC only, and to System V ABI calling convention.
I tried scanning back from each 'call' instruction, but I failed to find a good enough convention (e.g., where to stop scanning? what happen on two subsequent calls with the same arguments?). Assistance is highly appreciated.
Reposting my comments as an answer.
You can't reliably tell in optimized code. And even doing a good job most of the time probably requires human-level AI. e.g. did a function leave a value in RSI because it's a second argument, or was it just using RSI as a scratch register while computing a value for RDI (the first argument)? As Ross says, gcc-generated code for stack-args calling-conventions have more obvious patterns, but still nothing easy to detect.
It's also potentially hard to tell the difference between stores that spill locals to the stack vs. stores that store args to the stack (since gcc can and does use mov stores for stack-args sometimes: see -maccumulate-outgoing-args). One way to tell the difference is that locals will be reloaded later, but args are always assumed to be clobbered.
what happen on two subsequent calls with the same arguments?
Compilers always re-write args before making another call, because they assume that functions clobber their args (even on the stack). The ABI says that functions "own" their args. Compilers do make code that does this (see comments), but compiler-generated code isn't always willing to re-purpose the stack memory holding its args for storing completely different args in order to enable tail-call optimization. :( This is hand-wavey because I don't remember exactly what I've seen as far as missed tail-call optimization opportunities.
Yet if arguments are passed by the stack, then it shall probably be the easier case (and I conclude that all 6 registers are used as well).
Even that isn't reliable. The System V x86-64 ABI is not simple.
int foo(int, big_struct, int) would pass the two integer args in regs, but pass the big struct by value on the stack. FP args are also a major complication. You can't conclude that seeing stuff on the stack means that all 6 integer arg-passing slots are used.
The Windows x64 ABI is significantly different: For example, if the 2nd arg (after adding a hidden return-value pointer if needed) is integer/pointer, it always goes in RDX, regardless of whether the first arg went in RCX, XMM0, or on the stack. It also requires the caller to leave "shadow space".
So you might be able to come up with some heuristics to will work ok for un-optimized code. Even that will be hard to get right.
For optimized code generated by different compilers, I think it would be more work to implement anything even close to useful than you'd ever save by having it.

ActionScript, possible race condition on mouse event

I'm at a bit of a loss here. I have an issue that I think may be due to mouse events taking precedence. I have a function f being invoked on mouse clicks - f does some work, then invokes another function g. Is it even possible that f runs, then another click happens - invoking f again - and then g is executed?
If my phrasing is hard to understand, I'll try to show what I think may be happening:
click1 ----- /-----------\
\ / \
f -- f-- g g
/ \ /
click2 ------------ / \--------
|---------------- timeline----------------------|
I can say for certain the issue only arises (out of ~50 slow and ~50 quick double-clicks) when clicking twice in very quick succession (and not always even then). I realize my figure may confuse more than it clarifies, but I'm not sure how else to communicate my thoughts. Any input is greatly appreciated!
AS3 is a single threaded code execution environment which will execute all relevant code as it present itself. if a click triggers the execution of a chain of methods, all those methods will run before any other code can be executed again. As a result there cannot be a race condition in the code execution of AS3 code because of it single threaded nature.
All events in AS3 are not a special case in that regard, when their listener triggers all its code is executed the same way and no other code can be executed until it's done.
Special cases are:
You can suspend the execution by using timers and such so execution of code will happen at a later time. In that case there's no guaranty the triggering of those timers will be in sync with their starting order.
Executing asynchronous command (like loading something), in that case there's no guaranty loading operations will happen in order either.
But those special cases are not violating the execution of code principle in AS3, all code execute in one thread so their cannot be overlapping of any kind.

What is more expensive? A context switch or a function call?

On a bare metal system (embedded microcontroller, no MMU, no paging) what is more expensive? A full context switch (register save & restore) or a function call (activation record allocation)?
I understand that this is highly dependent on calling convention and hardware capability, but how would I go about evaluating this?
EDIT:
To provide more context, I'm trying to model two scheduling schemes. The first being a pre-emptive scheduler with context switching between tasks. The second being a function pointer run queue where tasks are state-machines broken into several enque-able function calls (where enqueing occurs on an IO event driven basis).
For the most part, I can gather good data on how long my tasks take (both IO and CPU time) but I need some help figuring out the additional overhead costs to add as constants in my model.
Since the system calls that trigger context switches are function calls, and the hardware interrupts that can trigger context-switches are similar, (and require a call to an event/semaphore, and a jump/call to the scheduler entry point, to signal the context-switch), I would say that a function call would be cheaper CPU-cycle wise unless an unreasonable number of parameters were passed.
This smells like an XY problem - why do you ask this? Context switches and function calls are almost orthogonal - one is a stack-based mechanism, the other selects a different stack entirely.
You'd go about evaluating this by contrasting the techniques and their actual affects on overall data movement.
For example, on a 6502, an interrupt pushes: The Program Counter, X, Y, A, and status register. That's 6 bytes of actual data, and takes 7 CPU cycles.
Granted, a 6502 is a much simpler CPU than modern designs, but sits as a fundamental example of the problem.
Now, a function call can arguably be as little as a Jump Subroutine, which simply pushes the current PC on to the stack, and then changes the PC to the new location. On the 6502, a JSR cost 6 cycles.
If you consider JSR and BRK (software interrupt on 6502) as the primitives, JSR is cheaper than BRK, by 1 cycle. This is outside the costs of standing up the call frame.
Most context switches are done automatically (via a timer or whatever) to simulate multi processing. But some systems use the CPUs trap primitive for system calls (like INT in MS-DOS, and TRAP in the old Mac OS). So a soft interrupt still has to stand up the stack frame, just like a normal subroutine would.
In the end, a JSR is likely cheaper than any of the higher level switch mechanisms, simply because it's so lightweight. The interrupt mechanisms usually have a indirection mechanic (which is why they're used in system call so much) that a subroutine does not have. The compiler managed the subroutine addresses at compile time.
But those are the considerations to look at to evaluate raw performance.