Filling the gap in my understanding - function

I was going through the following series of lecture notes on OS :
http://williamstallings.com/Extras/OS-Notes/h3.html
Here while trying to explain the different outcomes the program for thread can produce it breaks down the execution of function and says the following line :
"sum first reads the value of a into a register. It then increments the register, then stores the contents of the register back into a. It then reads the values of of the control string, p and a into the registers that it uses to pass arguments to the printf routine. It then calls printf, which prints out the data"
I exactly don't know how a function is executed at the level of registers and at the same time don't know which topic should I learn to know more about it .
So , which topic encompasses this execution of function at the level of registers and the level of electronic circuits?
please kindly elaborate how a stack is incremented while a value is being read during the execution of function .
Thanks in advance.

The advice to look at the assambler code is already a good one. You can look up the assembler instructions and think what happens if at any instruction the thread execution changes to the other thread.
Look at this code
la a, %r0
ld [%r0],%r1
add %r1,1,%r1
st %r1,[%r0]
ld [%r0], %o3 ! parameters are passed starting with %o0
mov %o0, %o1
la .L17, %o0
call printf
In the first four lines (the a++) there are different possibilities how the execution can happen. You dont know if sum(1) or sum(0) is called first.
To understand what is ongoing on a deeper level I suggest you look up 'computer organization'. See for example this link Computer Organisation WikiBook.

Related

Cuda _sync functions, how to handle unknown thread mask?

This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.
For example, int __all_sync(unsigned mask, int predicate);
with semantics:
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
The docs assume that the caller knows which threads are active and can therefore populate mask accurately.
a mask must be passed that specifies the threads participating in the call
I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:
__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);
The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:
For example, this is illegal (will result in undefined behavior for warp 0):
if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);
The second choice, this time quoting from __activemask() vs __ballot_sync()
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
The operating semantics appear to be:
There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged
That suggests synchronising threads then using activemask,
__syncwarp();
__all_sync(__activemask(), predicate);
An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.
How do I retrieve semantically correct behaviour without that global transform?
TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.
Longer:
This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
That should be perfectly legal.
You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.
You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)
I think you have only 2 options:
grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).
There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.

What does "cleanup" in NEXT_INST_F and NEXT_INST_V mean?

I am plowing TCL source code and get confused at macro NEXT_INST_F and NEXT_INST_V in tclExecute.c. Specifically the cleanup parameter of the macro.
Initially I thought cleanup means the net number of slots consumed/popped from the stack, e.g. when 3 objects are popped out and 1 object pushed in, cleanup is 2.
But I see INST_LOAD_STK has cleanup set to 1, shouldn't it be zero since one object is popped out and 1 object is pushed in?
I am lost reading the code of NEXT_INST_F and NEXT_INST_V, there are too many jumps.
Hope you can clarify the semantic of cleanup for me.
The NEXT_INST_F and NEXT_INST_V macros (in the implementation of Tcl's bytecode engine) clean up the state of the operand stack and push the result of the operation before going to the next instruction. The only practical difference between the two is that one is designed to be highly efficient when the number of stack locations to be cleaned up is a constant number (from a small range: 0, 1 and 2 — this is the overwhelming majority of cases), and the other is less efficient but can handle a variable number of locations to clean up or a number outside the small range. So NEXT_INST_F is basically an optimised version of NEXT_INST_V.
The place where macros are declared in tclExecute.c has this to say about them:
/*
* The new macro for ending an instruction; note that a reasonable C-optimiser
* will resolve all branches at compile time. (result) is always a constant;
* the macro NEXT_INST_F handles constant (nCleanup), NEXT_INST_V is resolved
* at runtime for variable (nCleanup).
*
* ARGUMENTS:
* pcAdjustment: how much to increment pc
* nCleanup: how many objects to remove from the stack
* resultHandling: 0 indicates no object should be pushed on the stack;
* otherwise, push objResultPtr. If (result < 0), objResultPtr already
* has the correct reference count.
*
* We use the new compile-time assertions to check that nCleanup is constant
* and within range.
*/
However, instructions can also directly manipulate the stack. This complicates things quite a lot. Most don't, but that's not the same as all. If you were to view this particular load of code as one enormous pile of special cases, you'd not be very wrong.
INST_LOAD_STK (a.k.a loadStk if you're reading disassembly of some Tcl code) is an operation that will pop an unparsed variable name from the stack and push the value read from the variable with that name. (Or an error will be thrown.) It is totally expected to pop one value and push another (from objResultPtr) since we are popping (and decrementing the reference count) of the variable name value, and pushing and incrementing the reference count of a different value that was read from the variable.
The code to read and write variables is among the most twisty in the bytecode engine. Far more goto than is good for your health.

MIPS: legal to have two consecutive "load word" instructions into the same register?

Background: We're seeing a very intermittent crash in a function foo(int *p). The crash occurs while dereferencing p, whose value in these cases turns out to be 0xffffffff. An analysis of the core dump shows that foo() is called from the following assembly snippet:
bne ... somewhere else
lw $a0,44(sp)
lw $a0,40(sp)
jal foo()
lui s1, 0x1000
Inspecting memory in the core dump shows that 44(sp) is 0xffffffff, whereas 40(sp) is the correct value we intend to dereference. However, the value of a0 at the time of the crash, inside foo(), is 0xffffffff. (It's important to note that foo() in this case is just accessing a member; so it's literally the first instruction in foo() which is already attempting to access via a0, and crashing. Also, ra is pointing to the instruction following the above snippet, and s1 currently contains 0x10000000, so we're quite confident that foo() was, indeed, called from the above snippet.)
Our only theory at the moment is that the two consecutive lws into a0 are a hazard -- either a documented one, in which case this looks like a compiler bug; or an undocumented one.
So: is the above assembly legal? If it is, any other ideas about what could be going on here?
Thanks!
UPDATE: Well, turns out this was all a wild goose chase: a repeat analysis of the coredump by a colleague turned up a path in the code which I had missed, where there was a jump directly to the jal foo() instruction, immediately after having set a0 to 44(sp). In other words, there is a path in the code which is consistent with the result we're seeing that does not involve hazards, or "skipped instructions" or anything... I thought I checked this, but I guess I either didn't, or missed it... :(
Anyway, I've accepted markgz's answer, since it answers my original question about the legality of these instructions (apparently they are).
A quick search of the MIPS documentation for the MIPS32R2 ISA doesn't show any restrictions on LW after LW instructions.
There might be a bug in the MIPS implementation in your CPU. Things to look at include:
What address is 44(sp), 40(sp) - are they on a page boundary or a 256MByte boundary, or other interesting address?
Do either of the loads trigger a page fault?
Does patching the binary to insert a NOP, SSNOP, or a SYNC instruction between the loads make the problem go away?

PIC Assembly: Calling functions with variables

So say I have a variable, which holds a song number. -> song_no
Depending upon the value of this variable, I wish to call a function.
Say I have many different functions:
Fcn1
....
Fcn2
....
Fcn3
So for example,
If song_no = 1, call Fcn1
If song_no = 2, call Fcn2
and so forth...
How would I do this?
you should have compare function in the instruction set (the post suggests you are looking for assembly solution), the result for that is usually set a True bit or set a value in a register. But you need to check the instruction set for that.
the code should look something like:
load(song_no, $R1)
cmpeq($1,R1) //result is in R3
jmpe Fcn1 //jump if equal
cmpeq ($2,R1)
jmpe Fcn2
....
Hope this helps
I'm not well acquainted with the pic, but these sort of things are usually implemented as a jump table. In short, put pointers to the target routines in an array and call/jump to the entry indexed by your song_no. You just need to calculate the address into the array somehow, so it is very efficient. No compares necessary.
To elaborate on Jens' reply the traditional way of doing on 12/14-bit PICs is the same way you would look up constant data from ROM, except instead of returning an number with RETLW you jump forward to the desired routine with GOTO. The actual jump into the jump table is performed by adding the offset to the program counter.
Something along these lines:
movlw high(table)
movwf PCLATH
movf song_no,w
addlw table
btfsc STATUS,C
incf PCLATH
addwf PCL
table:
goto fcn1
goto fcn2
goto fcn3
.
.
.
Unfortunately there are some subtleties here.
The PIC16 only has an eight-bit accumulator while the address space to jump into is 11-bits. Therefore both a directly writable low-byte (PCL) as well as a latched high-byte PCLATH register is available. The value in the latch is applied as MSB once the jump is taken.
The jump table may cross a page, hence the manual carry into PCLATH. Omit the BTFSC/INCF if you know the table will always stay within a 256-instruction page.
The ADDWF instruction will already have been read and be pointing at table when PCL is to be added to. Therefore a 0 offset jumps to the first table entry.
Unlike the PIC18 each GOTO instruction fits in a single 14-bit instruction word and PCL addresses instructions not bytes, so the offset should not be multiplied by two.
All things considered you're probably better off searching for general PIC16 tutorials. Any of these will clearly explain how data/jump tables work, not to mention begin with the basics of how to handle the chip. Frankly it is a particularly convoluted architecture and I would advice staying with the "free" hi-tech C compiler unless you particularly enjoy logic puzzles or desperately need the performance.

Designing A Simplified MIPS Processor

So I'm going over some old quizzes for my Computer Organization final and I must have missed this lecture or something. I'm decently proficient in programming MIPS, but this problem has me completely stumped. Could someone help me understand this?
The diagram is missing lines connecting the various parts of the processor as well as a multiplexor for determining if the next instruction address is coming from the PC+4 or from a register as in a jr ra instruction.
There needs to be a line from the ALU to the write data portion of the register file. This is for R-Type instructions, as their result will need to be written back to the destination register. Going into the ALU needs to be lines from Read data 1 and Read data 2, this is how the values from the registers make their way into the ALU for R-type instructions.
A couple lines have been added from the instruction memory to the registers, you're missing the one to the Write Register though (this specifies the destination register for R-type instructions).
For the PC, the line going into the adder goes into the other input (the above one). The 4 for the adder is constant, as each instruction address is 4 bytes after the previous one, so unless we're jumping to an address we will be executing the instruction immediately after the current instruction. The line from the PC to the read address is also necessary as the PC specifies which instruction address to find the current instruction at. The line going into the PC comes from either the result of the PC+4 addition or from the register specified in a jr ra instruction.
To handle this decision a multiplexor is needed. Multiplexors have two inputs and one output, so this one will have one input from the PC+4 adder (for regular R-type instructions) and another from Read Data 1 (for jr ra instructions). The input from Read Data 1 should be visualized as a split from the line between Read Data 1 and the ALU. The output will go right back to the PC, as it determines the next instruction to execute.
I think that's everything that's wanted for that question, as the prof specifies that control signals are already generated (the multiplexor is a type of control unit, but I think it's necessary nonetheless). Hope that helps!