Tcl upvar performance improvement vs. direct pass - tcl

This pertains to Tcl 8.5
Say I have a very large dictionary.
From performance points of view (memory footprint, etc), assuming that I do not modify the dictionary, should upvar provide a massive performance improvement in terms of memory? I am using an EDA tool, which has TCL shell, but the vendor disabled the TCL memory command. I know that Tcl can share strings under the hood for performance... The same dictionary can be passed several nested procs' calls.
Thanks.

As long as you don't modify the dictionary, it won't provide much noticeable performance difference or memory-consumption difference.
Tcl passes values by immutable reference, and copies them when you write an update to them if they're shared, e.g., between a global variable and a local variable (procedure formal parameters are local variables). If you never change anything, you're going to use a shared reference and everything will be quick. If you do need to change something, you should use upvar or global (or one of the more exotic variants) to make a local variable alias to the caller's/global variable and change via that, as that's fastest. But that's only an issue if you're going to change the value.

I would imagine that under the hood the dictionary isn't copied until it's written to so if there's no writes then you should be okay. Use a global if you want to be absolutely sure.
proc myproc {} {
global mydictionary
}

Related

MIPS functions and variables in stack

I have come in contact with MIPS-32, and I came with the question if a variable, for example $t0 declared having the value in one function can be altered by another and how this does have to do with stack, this is, the location of the variable in memory. Everything that I am talking is in assembly language. And more, I would like some examples concerning this use, this is, a function altering or not, a variable value of another function, and how this variable "survive" or not in terms of if the variable is given as a copy or a reference.
(I hope we can create an environment where conceptual question like that above can be explored more)
$t0 declared having the value in one function can be altered by another
$t0 is known as a call-clobbered register.  It is no different than the other registers as far as the hardware is concerned — being call clobbered vs. call preserved is an aspect of software convention, call the calling convention, which is a subset of an Application Binary Interface (ABI).
The calling convention, when followed, allows a function, F, to call another function, G, knowing only G's signature — name, parameters & their types, return type.  The function, F, would not have to also be changed if G changes, as long as both follow the convention.
Call clobbered doesn't mean it has to be clobbered, though, and when writing your own code you can use it any way you like (unless your coursework says to follow the MIPS32 calling convention, of course).
By the convention, a call-clobbered register can be used without worry: all you have to do use it is put a value into it!
Call preserved registers can also be used, if desired, but they should be presumed to be already in use by some caller (maybe not the immediate caller, but some distant caller), the values they contain must be restored before exiting the function back to return to its caller.  This is, of course, only possible by saving the original value before repurposing the register for a new use.
The two sets of register (call clobbered/preserved) serve two common use cases, namely cheap temporary storage and long term variables.  The former requires no effort to preserve/restore, while the latter both does require this effort, though gives us registers that will survive a function call, which is useful, for example, when a loop has a function call.
The stack comes into play when we need to first preserve, then restore call-preserved registers.  If we want to use call-preserved registers for some reason, then we need to preserve their original values in order to restore them later.  The most reasonable way to do that is to save them in the stack.  In order to do that we allocate some space from the stack.
To allocate some local memory, the stack pointer is decremented to reserve a function some space.  Since the stack pointer, upon return to caller, must have the same value, this space is necessarily deallocated upon return.  Hence the stack is great for local storage.  Original values of preserved registers must be also restored upon return to caller and so using local storage is appropriate.
https://www.dyncall.org/docs/manual/manualse11.html — search for section "MIPS32".
Let's also make the distinction between variables, a logical concept, and storage, a physical concept.
In high level language, variables are named and have scopes (limited lifetimes).  In machine code, we have physical hardware (storage) resources of registers and memory; these simply exist: they have no concept of lifetime.  In and of themselves these hardware resources are not variables, but places that we can use to hold variables for their lifetime/scope.
As assembly language programmers, we keep a mental (or even written) map of our logical variables to physical resources.  The compiler does the same, knowing the scope/lifetime of program variables and creating that "mental" map of variables to machine code storage.  Variables that have overlapping lifetimes cannot share the same hardware resource, of course, but when a variable is out of scope, its (mapped-to) physical resource can be reused for another purpose.
Logical variables can also move around to different physical resources.  A logical variable that is a parameter, may be passed in a CPU register, e.g. $a0, but then be moved into an $s register or into a (stack) memory location.  Such is the business of machine code.
To allocate some hardware storage to a high level language (or pseudo code) variable, we simply initialize the storage!  Hardware resources are necessarily constantly being repurposed to hold a different logical variable.
See also:
How a recursive function works in MIPS? — for discussion on variable analysis.
Mips/assembly language exponentiation recursivley
What's the difference between caller-saved and callee-saved in RISC-V

How many arguments are passed in a function call?

I wish to analyze assembly code that calls functions, and for each 'call' find out how many arguments are passed to the function. I assume that the target functions are not accessible to me, but only the calling code.
I limit myself to code that was compiled with GCC only, and to System V ABI calling convention.
I tried scanning back from each 'call' instruction, but I failed to find a good enough convention (e.g., where to stop scanning? what happen on two subsequent calls with the same arguments?). Assistance is highly appreciated.
Reposting my comments as an answer.
You can't reliably tell in optimized code. And even doing a good job most of the time probably requires human-level AI. e.g. did a function leave a value in RSI because it's a second argument, or was it just using RSI as a scratch register while computing a value for RDI (the first argument)? As Ross says, gcc-generated code for stack-args calling-conventions have more obvious patterns, but still nothing easy to detect.
It's also potentially hard to tell the difference between stores that spill locals to the stack vs. stores that store args to the stack (since gcc can and does use mov stores for stack-args sometimes: see -maccumulate-outgoing-args). One way to tell the difference is that locals will be reloaded later, but args are always assumed to be clobbered.
what happen on two subsequent calls with the same arguments?
Compilers always re-write args before making another call, because they assume that functions clobber their args (even on the stack). The ABI says that functions "own" their args. Compilers do make code that does this (see comments), but compiler-generated code isn't always willing to re-purpose the stack memory holding its args for storing completely different args in order to enable tail-call optimization. :( This is hand-wavey because I don't remember exactly what I've seen as far as missed tail-call optimization opportunities.
Yet if arguments are passed by the stack, then it shall probably be the easier case (and I conclude that all 6 registers are used as well).
Even that isn't reliable. The System V x86-64 ABI is not simple.
int foo(int, big_struct, int) would pass the two integer args in regs, but pass the big struct by value on the stack. FP args are also a major complication. You can't conclude that seeing stuff on the stack means that all 6 integer arg-passing slots are used.
The Windows x64 ABI is significantly different: For example, if the 2nd arg (after adding a hidden return-value pointer if needed) is integer/pointer, it always goes in RDX, regardless of whether the first arg went in RCX, XMM0, or on the stack. It also requires the caller to leave "shadow space".
So you might be able to come up with some heuristics to will work ok for un-optimized code. Even that will be hard to get right.
For optimized code generated by different compilers, I think it would be more work to implement anything even close to useful than you'd ever save by having it.

Does Tcl eval command prevent byte coding?

I know that in some dynamic, interpreted languages, using eval can slow things down, as it stops byte-coding.Is it so in Tcl 8.5?
Thanks
It doesn't prevent bytecode compilation, but it can slow things down anyway. The key issue is that it can prevent the bytecode compiler from having access to the local variable table (LVT) during compilation, forcing variable accesses to go via a hash lookup. Tcl's got an ultra-fast hash algorithm (we've benchmarked it a lot and tried a lot of alternatives; it's very hot code) but the LVT has it beat as that's just a simple C array lookup when the bytes hit the road. The LVT is only known properly when compiling a whole procedure (or other procedure-like thing, such as a lambda term or TclOO method).
Now, I have tried making this specific case:
eval {
# Do stuff in here...
}
be fully bytecode-compiled and it mostly works (apart from a few weird things that are currently observable but perhaps shouldn't be) yet for the amount that we use that, it's just plain not worth it. In any other case, the fact that the script can't be known too precisely at the point where the compiler is running forces the LVT-less operation mode.
On the other hand, it's not all doom and gloom. Provided the actual script being run inside the eval doesn't change (and that includes not being regenerated through internal concat — multi-argument eval never gets this benefit) Tcl can cache the compilation of the code in the internal representation of the script value, LVT-less though it is, and so there's still quite a bit of performance gain there. This means that this isn't too bad, performance wise:
set script {
foo bar $boo
}
for {set i 0} {$i < 10} {incr i} {
eval $script
}
If you have real performance-sensitive code, write it without eval. Expansion syntax — {*} — can help here, as can helper procedures. Or write the critical bits in C or C++ or Fortran or … (see the critcl and ffidl extension packages for details of cool ways to do this, or just load the DLL as needed if it has a suitable *_Init function).

cudaMemcpy() vs cudaMemcpyFromSymbol()

I'm trying to figure out why cudaMemcpyFromSymbol() exists. It seems everything that 'symbol' func can do, the nonSymbol cmds can do.
The symbol func appears to make it easy for part of an array or index to be moved, but this could just as easily be done with the nonSymbol function. I suspect the nonSymbol approach will run faster as there is no symbol-lookup needed. (It is not clear if the symbol look up calculation is done at compile or run time.)
Why would I use cudaMemcpyFromSymbol() vs cudaMemcpy()?
cudaMemcpyFromSymbol is the canonical way to copy from any statically defined variable in device memory.
cudaMemcpy can't be directly use to copy to or from a statically defined device variable because it requires a device pointer, and that isn't known to host code at runtime. Therefore, an API call which can interrogate the device context symbol table is required. The two choices are either, cudaMemcpyFromSymbol which does the symbol lookup and copy in one operation, or cudaGetSymbolAddress which returns an address which can be passed to cudaMemcpy. The former is probably more efficient if you only want to do one copy, the latter if you want to use the address multiple times in host code.

Performance of Pure TCL vs TCL C API's for populating a TCL array

Will reading a file using TCL C API's and populating a TCL array be much faster compared
to doing the same with standard TCL. I have a large file about 100+MB which I need to read and set some hash entries. Using TCL C API's doesn't seems to provide atmost 2 to 4 times speed advantage. Is this usual or am I missing something?
You're unlikely to get much of a performance gain in this case, as when you're setting array entries from the C API, you're bearing much of the cost that you'd experience if you just wrote the code as Tcl inside a procedure. In particular, you could very easily get worse performance through using an inefficient sub-API; some of Tcl's API functions are not very fast (e.g., Tcl_SetVar) but they're kept because of the large amount of existing code that uses them (and the fact that the faster functions require more C code to use). Bear in mind that setting an array element requires a mandatory hash table lookup, and those have a real cost (despite the fact that Tcl uses a very fast — if rather stupid — hash).
What's more, you can get better performance by using a Tcl list or dictionary (depending on what exactly you want to store) and the C API to those is quite quick (especially for lists, which are really C arrays of Tcl_Obj references). What I don't know is whether doing that would be a suitable substitute for your purposes.
The C API is there primarily to allow you to write Tcl extensions and just exposes the routines that 'pure Tcl' itself is written in. In a case like you describe I wouldn't expect to see much performance difference and remember:
Premature optimization is the root of all evil (or at least most of
it) in programming.
Computer Programming as an Art (1974) by Donald Knuth
If you really need to load lots of data, maybe some extension like NAP (http://wiki.tcl.tk/4015) or similar would be appropriate?