error in Assigning values to bytes in a 2d array of registers in Verilog .Error - warnings

Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!

Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.

Related

What does "cleanup" in NEXT_INST_F and NEXT_INST_V mean?

I am plowing TCL source code and get confused at macro NEXT_INST_F and NEXT_INST_V in tclExecute.c. Specifically the cleanup parameter of the macro.
Initially I thought cleanup means the net number of slots consumed/popped from the stack, e.g. when 3 objects are popped out and 1 object pushed in, cleanup is 2.
But I see INST_LOAD_STK has cleanup set to 1, shouldn't it be zero since one object is popped out and 1 object is pushed in?
I am lost reading the code of NEXT_INST_F and NEXT_INST_V, there are too many jumps.
Hope you can clarify the semantic of cleanup for me.
The NEXT_INST_F and NEXT_INST_V macros (in the implementation of Tcl's bytecode engine) clean up the state of the operand stack and push the result of the operation before going to the next instruction. The only practical difference between the two is that one is designed to be highly efficient when the number of stack locations to be cleaned up is a constant number (from a small range: 0, 1 and 2 — this is the overwhelming majority of cases), and the other is less efficient but can handle a variable number of locations to clean up or a number outside the small range. So NEXT_INST_F is basically an optimised version of NEXT_INST_V.
The place where macros are declared in tclExecute.c has this to say about them:
/*
* The new macro for ending an instruction; note that a reasonable C-optimiser
* will resolve all branches at compile time. (result) is always a constant;
* the macro NEXT_INST_F handles constant (nCleanup), NEXT_INST_V is resolved
* at runtime for variable (nCleanup).
*
* ARGUMENTS:
* pcAdjustment: how much to increment pc
* nCleanup: how many objects to remove from the stack
* resultHandling: 0 indicates no object should be pushed on the stack;
* otherwise, push objResultPtr. If (result < 0), objResultPtr already
* has the correct reference count.
*
* We use the new compile-time assertions to check that nCleanup is constant
* and within range.
*/
However, instructions can also directly manipulate the stack. This complicates things quite a lot. Most don't, but that's not the same as all. If you were to view this particular load of code as one enormous pile of special cases, you'd not be very wrong.
INST_LOAD_STK (a.k.a loadStk if you're reading disassembly of some Tcl code) is an operation that will pop an unparsed variable name from the stack and push the value read from the variable with that name. (Or an error will be thrown.) It is totally expected to pop one value and push another (from objResultPtr) since we are popping (and decrementing the reference count) of the variable name value, and pushing and incrementing the reference count of a different value that was read from the variable.
The code to read and write variables is among the most twisty in the bytecode engine. Far more goto than is good for your health.

Octave force deepcopy

The question
What are the ways of coercing octave to create a real copy of whatever object? Structures are the main interest.
My underlying problem
In my problem I'm obtaining a rather large structure from another function in a loop but for the current task only a few pieces of it are needed. For example:
for i=1:many
res=solver(params);
store1{i}=res.string1;
store2{i}=res.arr(:,1);
end
res is a sizable chunk of data and due to lazy-copy those store-s are references to tiny portions of bytes in that chunk. After I store those tiny portions, I don't need res itself, however, since middle of that chunk is referenced by store, the memory area is unfit for res obtained on the next iteration (they are of the same size) and thus another sizable piece of memory is allocated, which is then again crossed by few tiny links an so on.
Without storing parts of res, the program successfully keeps the memory consumption same after first couple of iterations.
So how do I make a complete copy of structure field?
I've tried using struct-related functions like rmfield but those keep references instead of their own objects.
I've tried to wrap the assignment of in its own function:
new_struct=copy( rmfield(old_struct,"bigdata"));
function c=copy(a);
c=a;
end;
This by the way doesn't work even for arrays.
I'm interested in method applicable to any generic variable.
Minimal working example of the problem
a=cell(3,1);
for i=1:length(a);
r=rand(100000,1000);
a{i}=r(1:100,end);
whos; fflush(stdout);
pause(2);
end;
The above code will cause memory usage to gradually grow by far more than 8.08 kb reported by whos due to references stored by a{i} blocking bigger memory block than they actually need. If you force the proper copy, the problem is not present.
Numerical arrays
For numeric types addition of zero is enough to warrant a new array.
c=a+0;
Strings
For string which is 1 x n char array, something along the following lines will work:
c=[a "a"](1:end-1);
Multidimensional char arrays will require concatenation with a column:
c=[a true(size(a,1),1)](:,1:end-1);
Here true is used to generate dummy array of size compatible with char. (There seems to be no procedural method of generating char array of arbitrary size) char(zeros(size(a,1),1)) and char(true(size(a,1),1)) caused excess memory usage during their creation on some calls.
Note that empty concatenation c=[a ""]; will not result in a copying. Also it is possible to do c=[a+0 ""]; which will result in a copying due to +0 but that one infers type conversions to and from double which is 8 times larger in size. (char(zeros( doesn't seem to cause that)
Other types
In general you can use casting for the types allowed by it in order to not tailor the expressions manually as I had to do above:
typelist={"double","single","char"}; %full list of supported types is available in the link
class_of_a = typelist{ isa(a,typelist) };
c=typecast( [typecast(a,'single'); single(1)] (1:end-1), class_of_a);
Single is seemingly smallest datatype available in octave.
Note that logical is not supported by this method.
Copying structures
Apparently you'd have to write your own function to go around struct fields, copy them with above methods and recursively go to substructs.
(As it doesn't involve complexities relevant here, I'd rather leave that to be done by those who actually needs that, my own problem being solved by +0's.)

Mips datapath procedure for executing an AND instruction?

Based on this figure, executing the AND instruction would cause these values to be assigned to the signals labeled in blue:
RegWrite = 1
ALUSrc = 0
ALU operation = 0000
MemRead = 0
MemWrite = 0
MemtoReg = 0
PCSrc =0
However, I am a little confused which inputs will be used in the Registers block? Can anyone describe the overall AND procedure in the MIPS datapath?
Starting from after the instruction is read from instruction memory, you need to know that AND is an r-type instruction and thus uses 3 registers. Which register is actually used is based off of the encoded instruction. (An R-Type has 3 5-bit fields, one for each reg.) rs and rt go to Read register 1 and 2, while rd goes to Write register. From there, Read data 1 and 2 (the bits of registers s and t) go to the ALU where a bitwise AND is performed on them. The result of that is written to the write register. I traced the path in your picture (omitting the PC incrementing part). I'm taking a class that uses that exact book this semester. If you look a little ahead, it goes deeper into what is going on. The explanation of the control (blue) lines helps a lot. The mux blocks are multiplexers, that is they allow alternating the output between two inputs. In this case, the ALUSrc mux will use Read data 2 because AND is an r-type. If it were i-type, it would switch to use the data coming from the sign extend, because that would be the immediate. The other mux is to allow either memory from data to be written to the write register or the result of an ALU operation. In this case, it will be the result of an ALU operation.
To imply answer your question about the register block, keep in mind that the inputs to the register block are the addresses of the registers your instruction will be using, the register block then either fetches the data in the registers who's addresses were given or write data at the end on this register.
However one note you have an inconsistency in your mux design MemtoReg and ALUSrc should have opposite values, so unless one of the 2 muxes is upside down (which is not advisable) then there is a mistake with your controller logic.

PIC Assembly: Calling functions with variables

So say I have a variable, which holds a song number. -> song_no
Depending upon the value of this variable, I wish to call a function.
Say I have many different functions:
Fcn1
....
Fcn2
....
Fcn3
So for example,
If song_no = 1, call Fcn1
If song_no = 2, call Fcn2
and so forth...
How would I do this?
you should have compare function in the instruction set (the post suggests you are looking for assembly solution), the result for that is usually set a True bit or set a value in a register. But you need to check the instruction set for that.
the code should look something like:
load(song_no, $R1)
cmpeq($1,R1) //result is in R3
jmpe Fcn1 //jump if equal
cmpeq ($2,R1)
jmpe Fcn2
....
Hope this helps
I'm not well acquainted with the pic, but these sort of things are usually implemented as a jump table. In short, put pointers to the target routines in an array and call/jump to the entry indexed by your song_no. You just need to calculate the address into the array somehow, so it is very efficient. No compares necessary.
To elaborate on Jens' reply the traditional way of doing on 12/14-bit PICs is the same way you would look up constant data from ROM, except instead of returning an number with RETLW you jump forward to the desired routine with GOTO. The actual jump into the jump table is performed by adding the offset to the program counter.
Something along these lines:
movlw high(table)
movwf PCLATH
movf song_no,w
addlw table
btfsc STATUS,C
incf PCLATH
addwf PCL
table:
goto fcn1
goto fcn2
goto fcn3
.
.
.
Unfortunately there are some subtleties here.
The PIC16 only has an eight-bit accumulator while the address space to jump into is 11-bits. Therefore both a directly writable low-byte (PCL) as well as a latched high-byte PCLATH register is available. The value in the latch is applied as MSB once the jump is taken.
The jump table may cross a page, hence the manual carry into PCLATH. Omit the BTFSC/INCF if you know the table will always stay within a 256-instruction page.
The ADDWF instruction will already have been read and be pointing at table when PCL is to be added to. Therefore a 0 offset jumps to the first table entry.
Unlike the PIC18 each GOTO instruction fits in a single 14-bit instruction word and PCL addresses instructions not bytes, so the offset should not be multiplied by two.
All things considered you're probably better off searching for general PIC16 tutorials. Any of these will clearly explain how data/jump tables work, not to mention begin with the basics of how to handle the chip. Frankly it is a particularly convoluted architecture and I would advice staying with the "free" hi-tech C compiler unless you particularly enjoy logic puzzles or desperately need the performance.

Cuda Kernel with different array sizes

I am working on a fluid dynamic problem in cuda and discovered a problem like this
if I have an array e.g debug_array with the length 600 and an array
value_array with the length 100 and I wanna do sth like
for(int i=0;i<6;i++)
{
debug_array[6*(bx*block_size+tx)+i] = value_array[bx*block_size+tx];
}
block_size would in this example be based on the 100 element array, e.g
4 blocks block_size 25
if value_array contains e.g 10;20;30;.....
I would expect debug_array to have groups of 6 similar values like
10;10;10;10;10;10;20;20;20;20;20;20;30......
The problem is that it is not picking up all values from the values array, any idea
why this isn't working or a good workaround.
What will work is if I define float val = value_array[bx*block_size+tx]; outside the for loop and keep this inside the loop debug_array[bx*block_size+tx+i] = val;
But I would like to avoid that as my kernels have between 5 and 10 device function inside the loop and it makes it just hard to read.
thanks in advance any advice will be appriciated
Markus
There seems to be an error in computing the index:
Let's assume bx = 0 and tx = 0
The first 6 elements in debug_array will be filled with data.
Next thread: tx = 1: Elements 1 to 7 will be filled with data (overwriting existing data).
Due to the threads working in parallel it is not determined which thread will be scheduled first and therefore which values will be written into the debug_array.
You should have written:
debug_array[6*(bx*block_size+tx)+i] = value_array[bx*block_size+tx];
If changing the code to move the value_array expression out of the loop and into a temp variable makes the code work - and that is the only code change you made - then this smells like a compiler bug.
Try changing your nvcc compiler options to reduce or disable optimizations and see if the value_array expression inside the loop changes behavior. Also, make sure you're using the latest CUDA tools.
Optimizing compilers will often attempt to move expressions that aren't dependent on the loop index variable out of the loop, exactly like your manual workaround. It's called "invariant code motion" and it makes loops faster by reducing the amount of code that executes in each iteration of the loop. If manually extracting the invariant code from the loop works, but letting the compiler figure it out on its own doesn't, that casts doubt on the compiler.