What does "cleanup" in NEXT_INST_F and NEXT_INST_V mean? - tcl

I am plowing TCL source code and get confused at macro NEXT_INST_F and NEXT_INST_V in tclExecute.c. Specifically the cleanup parameter of the macro.
Initially I thought cleanup means the net number of slots consumed/popped from the stack, e.g. when 3 objects are popped out and 1 object pushed in, cleanup is 2.
But I see INST_LOAD_STK has cleanup set to 1, shouldn't it be zero since one object is popped out and 1 object is pushed in?
I am lost reading the code of NEXT_INST_F and NEXT_INST_V, there are too many jumps.
Hope you can clarify the semantic of cleanup for me.

The NEXT_INST_F and NEXT_INST_V macros (in the implementation of Tcl's bytecode engine) clean up the state of the operand stack and push the result of the operation before going to the next instruction. The only practical difference between the two is that one is designed to be highly efficient when the number of stack locations to be cleaned up is a constant number (from a small range: 0, 1 and 2 — this is the overwhelming majority of cases), and the other is less efficient but can handle a variable number of locations to clean up or a number outside the small range. So NEXT_INST_F is basically an optimised version of NEXT_INST_V.
The place where macros are declared in tclExecute.c has this to say about them:
/*
* The new macro for ending an instruction; note that a reasonable C-optimiser
* will resolve all branches at compile time. (result) is always a constant;
* the macro NEXT_INST_F handles constant (nCleanup), NEXT_INST_V is resolved
* at runtime for variable (nCleanup).
*
* ARGUMENTS:
* pcAdjustment: how much to increment pc
* nCleanup: how many objects to remove from the stack
* resultHandling: 0 indicates no object should be pushed on the stack;
* otherwise, push objResultPtr. If (result < 0), objResultPtr already
* has the correct reference count.
*
* We use the new compile-time assertions to check that nCleanup is constant
* and within range.
*/
However, instructions can also directly manipulate the stack. This complicates things quite a lot. Most don't, but that's not the same as all. If you were to view this particular load of code as one enormous pile of special cases, you'd not be very wrong.
INST_LOAD_STK (a.k.a loadStk if you're reading disassembly of some Tcl code) is an operation that will pop an unparsed variable name from the stack and push the value read from the variable with that name. (Or an error will be thrown.) It is totally expected to pop one value and push another (from objResultPtr) since we are popping (and decrementing the reference count) of the variable name value, and pushing and incrementing the reference count of a different value that was read from the variable.
The code to read and write variables is among the most twisty in the bytecode engine. Far more goto than is good for your health.

Related

TFPCustomHashTable constructor use 196613 constant. Why use this particular value?

Following code is part of contnrs unit of FreePascal
constructor TFPCustomHashTable.Create;
begin
CreateWith(196613,#RSHash);
end;
I am curious about 196613. I know it is hash table size. Is there any particular reason why this value is used?
In my test, constructor took about 3-4 ms to execute, which in my particular situation is not accepted. I suspect this is related to this constant value.
Update:
My question are:
Why 196613 is chosen? Why not other prime number?
Does it affect constructor call execution time?
196613 is one of the numbers being recommended for hash tables sizes (prime and far away from powers of two), for more info see e.g. https://planetmath.org/goodhashtableprimes.
It affects the constructor call execution time, yes. You can always construct TFPCustomHashTable using CreateWith and pass a size of your choice (any number is fine, as resizing algorithm checks for suitable sizes anyways) as well as your own hash function (or the pre-defined RSHash):
MyHashTable:=TFPCustomHashTable.CreateWith(193,#RSHash);
Keep in mind though, that resizing a hash table is an expensive operation as it requires to recalculate the hash function for all elements, so starting with a too small value is neither a good idea.

Octave force deepcopy

The question
What are the ways of coercing octave to create a real copy of whatever object? Structures are the main interest.
My underlying problem
In my problem I'm obtaining a rather large structure from another function in a loop but for the current task only a few pieces of it are needed. For example:
for i=1:many
res=solver(params);
store1{i}=res.string1;
store2{i}=res.arr(:,1);
end
res is a sizable chunk of data and due to lazy-copy those store-s are references to tiny portions of bytes in that chunk. After I store those tiny portions, I don't need res itself, however, since middle of that chunk is referenced by store, the memory area is unfit for res obtained on the next iteration (they are of the same size) and thus another sizable piece of memory is allocated, which is then again crossed by few tiny links an so on.
Without storing parts of res, the program successfully keeps the memory consumption same after first couple of iterations.
So how do I make a complete copy of structure field?
I've tried using struct-related functions like rmfield but those keep references instead of their own objects.
I've tried to wrap the assignment of in its own function:
new_struct=copy( rmfield(old_struct,"bigdata"));
function c=copy(a);
c=a;
end;
This by the way doesn't work even for arrays.
I'm interested in method applicable to any generic variable.
Minimal working example of the problem
a=cell(3,1);
for i=1:length(a);
r=rand(100000,1000);
a{i}=r(1:100,end);
whos; fflush(stdout);
pause(2);
end;
The above code will cause memory usage to gradually grow by far more than 8.08 kb reported by whos due to references stored by a{i} blocking bigger memory block than they actually need. If you force the proper copy, the problem is not present.
Numerical arrays
For numeric types addition of zero is enough to warrant a new array.
c=a+0;
Strings
For string which is 1 x n char array, something along the following lines will work:
c=[a "a"](1:end-1);
Multidimensional char arrays will require concatenation with a column:
c=[a true(size(a,1),1)](:,1:end-1);
Here true is used to generate dummy array of size compatible with char. (There seems to be no procedural method of generating char array of arbitrary size) char(zeros(size(a,1),1)) and char(true(size(a,1),1)) caused excess memory usage during their creation on some calls.
Note that empty concatenation c=[a ""]; will not result in a copying. Also it is possible to do c=[a+0 ""]; which will result in a copying due to +0 but that one infers type conversions to and from double which is 8 times larger in size. (char(zeros( doesn't seem to cause that)
Other types
In general you can use casting for the types allowed by it in order to not tailor the expressions manually as I had to do above:
typelist={"double","single","char"}; %full list of supported types is available in the link
class_of_a = typelist{ isa(a,typelist) };
c=typecast( [typecast(a,'single'); single(1)] (1:end-1), class_of_a);
Single is seemingly smallest datatype available in octave.
Note that logical is not supported by this method.
Copying structures
Apparently you'd have to write your own function to go around struct fields, copy them with above methods and recursively go to substructs.
(As it doesn't involve complexities relevant here, I'd rather leave that to be done by those who actually needs that, my own problem being solved by +0's.)

PIC Assembly: Calling functions with variables

So say I have a variable, which holds a song number. -> song_no
Depending upon the value of this variable, I wish to call a function.
Say I have many different functions:
Fcn1
....
Fcn2
....
Fcn3
So for example,
If song_no = 1, call Fcn1
If song_no = 2, call Fcn2
and so forth...
How would I do this?
you should have compare function in the instruction set (the post suggests you are looking for assembly solution), the result for that is usually set a True bit or set a value in a register. But you need to check the instruction set for that.
the code should look something like:
load(song_no, $R1)
cmpeq($1,R1) //result is in R3
jmpe Fcn1 //jump if equal
cmpeq ($2,R1)
jmpe Fcn2
....
Hope this helps
I'm not well acquainted with the pic, but these sort of things are usually implemented as a jump table. In short, put pointers to the target routines in an array and call/jump to the entry indexed by your song_no. You just need to calculate the address into the array somehow, so it is very efficient. No compares necessary.
To elaborate on Jens' reply the traditional way of doing on 12/14-bit PICs is the same way you would look up constant data from ROM, except instead of returning an number with RETLW you jump forward to the desired routine with GOTO. The actual jump into the jump table is performed by adding the offset to the program counter.
Something along these lines:
movlw high(table)
movwf PCLATH
movf song_no,w
addlw table
btfsc STATUS,C
incf PCLATH
addwf PCL
table:
goto fcn1
goto fcn2
goto fcn3
.
.
.
Unfortunately there are some subtleties here.
The PIC16 only has an eight-bit accumulator while the address space to jump into is 11-bits. Therefore both a directly writable low-byte (PCL) as well as a latched high-byte PCLATH register is available. The value in the latch is applied as MSB once the jump is taken.
The jump table may cross a page, hence the manual carry into PCLATH. Omit the BTFSC/INCF if you know the table will always stay within a 256-instruction page.
The ADDWF instruction will already have been read and be pointing at table when PCL is to be added to. Therefore a 0 offset jumps to the first table entry.
Unlike the PIC18 each GOTO instruction fits in a single 14-bit instruction word and PCL addresses instructions not bytes, so the offset should not be multiplied by two.
All things considered you're probably better off searching for general PIC16 tutorials. Any of these will clearly explain how data/jump tables work, not to mention begin with the basics of how to handle the chip. Frankly it is a particularly convoluted architecture and I would advice staying with the "free" hi-tech C compiler unless you particularly enjoy logic puzzles or desperately need the performance.

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

How to tell whether Haskell will cache a result or recompute it?

I noticed that sometimes Haskell pure functions are somehow cached: if I call the function twice with the same parameters, the second time the result is computed in no time.
Why does this happen? Is it a GHCI feature or what?
Can I rely on this (ie: can I deterministically know if a function value will be cached)?
Can I force or disable this feature for some function calls?
As required by comments, here is an example I found on the web:
isPrime a = isPrimeHelper a primes
isPrimeHelper a (p:ps)
| p*p > a = True
| a `mod` p == 0 = False
| otherwise = isPrimeHelper a ps
primes = 2 : filter isPrime [3,5..]
I was expecting, before running it, to be quite slow, since it keeps accessing elements of primes without explicitly caching them (thus, unless these values are cached somewhere, they would need to be recomputed plenty times). But I was wrong.
If I set +s in GHCI (to print timing/memory stats after each evaluation) and evaluate the expression primes!!10000 twice, this is what I get:
*Main> :set +s
*Main> primes!!10000
104743
(2.10 secs, 169800904 bytes)
*Main> primes!!10000
104743
(0.00 secs, 0 bytes)
This means that at least primes !! 10000 (or better: the whole primes list, since also primes!!9999 will take no time) must be cached.
primes, in your code, is not a function, but a constant, in haskellspeak known as a CAF. If it took a parameter (say, ()), you would get two different versions of the same list back if calling it twice, but as it is a CAF, you get the exact same list back both times;
As a ghci top-level definition, primes never becomes unreachable, thus the head of the list it points to (and thus its tail/the rest of the computation) is never garbage collected. Adding a parameter prevents retaining that reference, the list would then be garbage collected as (!!) iterates down it to find the right element, and your second call to (!!) would force repetition of the whole computation instead of just traversing the already-computed list.
Note that in compiled programs, there is no top-level scope like in ghci and things get garbage collected when the last reference to them is gone, quite likely before the whole program exits, CAF or not, meaning that your first call would take long, the second one not, and after that, "the future of your program" not referencing the CAF anymore, the memory the CAF takes up is recycled.
The primes package provides a function that takes an argument for (primarily, I'd claim) this very reason, as carrying around half a terabyte of prime numbers might not be what one wants to do.
If you want to really get down to the bottom of this, I recommend reading the STG paper. It doesn't include newer developments in GHC, but does a great job of explaining how Haskell maps onto assembly, and thus how thunks get eaten by strictness, in general.