VHDL warning: PAR will not attempt to route this signal - warnings

I am learning VHDL and I am on the quest to implement my own FIFO buffer, but I have some problems. Since I want to deploy the code on a Xilinx Spartan 6 device I am using the Xilinx WebPack ISE with the associated VHDL compiler, but I am getting very weird warnings:
WARNING:Par:288 - The signal Mram_buf_mem1_RAMD_D1_O has no load. PAR will not attempt to route this signal.
WARNING:Par:283 - There are 1 loadless signals in this design. This design will cause Bitgen to issue DRC warnings.
Here is my code:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity FIFO_buffer is
generic ( BUFFER_SIZE : positive := 4; -- # of words
WORD_WIDTH : positive := 8); -- # of bits per word
port ( data_in : in STD_LOGIC_VECTOR (WORD_WIDTH - 1 downto 0);
full : out STD_LOGIC := '0';
write : in STD_LOGIC;
data_out : out STD_LOGIC_VECTOR (WORD_WIDTH - 1 downto 0);
empty : out STD_LOGIC := '1';
read : in STD_LOGIC);
end FIFO_buffer;
architecture arch of FIFO_buffer is
type ram_t is array (0 to BUFFER_SIZE - 1) of std_logic_vector(WORD_WIDTH - 1 downto 0);
signal buf_mem : ram_t := (others => (others=>'0'));
signal read_idx : integer range 0 to BUFFER_SIZE - 1 := 0;
signal write_idx : integer range 0 to BUFFER_SIZE - 1 := 0;
signal buf_full : std_logic := '0';
signal buf_empty : std_logic := '0';
begin
writing_data: process(write)
begin
if(rising_edge(write)) then
if(buf_full = '0') then
buf_mem(write_idx) <= data_in;
write_idx <= write_idx + 1;
if(write_idx = read_idx)
then buf_full <= '1';
else buf_full <= '0';
end if;
end if;
end if;
end process;
reading_data: process(read)
begin
if(rising_edge(read)) then
if(buf_empty = '0') then
data_out <= buf_mem(read_idx);
read_idx <= read_idx + 1;
if(read_idx = write_idx)
then buf_empty <= '1';
else buf_empty <= '0';
end if;
end if;
end if;
end process;
full <= buf_full;
empty <= buf_empty;
end arch;
The error seems to be caused by the data_out <= buf_mem(read_idx); line in the reading_data process. Could anyone explain to me the reason for the warning? (I know that my code has some functional problems, but that should not affect the reason for the warning)
P.S. Since I have the code here let me ask one more question. How unwise is it to have a component (such as that FIFO buffer) which is not synchronised with the global clock?

I'll address your second question first, i.e. " How unwise is it to have a component (such as that FIFO buffer) which is not synchronised with the global clock?"
It depends on your requirements. Usually, you should clock your components, so you have synchronous logic and no weird glitches caused by asynchronous paths.
However, consider what you did here. You have clocked your component: rising_edge(read) and rising_edge(write). You will find in your synthesis report the following:
Primitive and Black Box Usage:
------------------------------
<snip>
# Clock Buffers : 2
# BUFGP : 2
<snip>
Clock Information:
------------------
-----------------------------------+------------------------+-------+
Clock Signal | Clock buffer(FF name) | Load |
-----------------------------------+------------------------+-------+
read | BUFGP | 11 |
write | BUFGP | 6 |
-----------------------------------+------------------------+-------+
This is because you're not using a combinational process. This will lead to all kinds of problems. You mention a Xilinx Spartan-6. You will get the following message along the line (usually an ERROR), assuming you did not accidentally place read and write at an optimal IOB/BUFG site pair:
Place:1109 - A clock IOB / BUFGMUX clock component pair have been found
that are not placed at an optimal clock IOB / BUFGMUX site pair. The clock
IOB component <write> is placed at site <A5>. The corresponding BUFG
component <write_BUFGP/BUFG> is placed at site <BUFGMUX_X2Y9>. There is only
a select set of IOBs that can use the fast path to the Clocker buffer, and
they are not being used. You may want to analyze why this problem exists and
correct it.
What this message explains in great verbosity is the following. FPGAs have dedicated routing networks for clocks, which assure low skew. (Check Xilinx UG382 for more). However, there are specific pins on the FPGA that can directly access this clock network. There, IOB (I/O Buffer) and BUFG[MUX] ([Multiplexed] Global [Clock] Buffer) are close-by, ensuring that the signal from the pin can be distributed really fast across the whole FPGA using dedicated clocking resources. You can check placement with the FPGA Editor. For instance, my write pin has to cross half the FPGA before being able to get routed using a global clock buffer. That's 3.878ns delay in my case.
The same applies for read, of course. So you see this is a bad idea. You should use dedicated clocking resources for your clocks and synchronize inputs and outputs to that.
Now, on to your main question. You have to be aware what your HDL actually describes.
You have two distinct processes, each with their own clock (read; write) that access the same memory. You have two distinct addresses as well (write_idx; read_idx).
Hence, the XST Synthesizer (that ISE uses) inferred a dual-port RAM. Because the depth as well as element width are both small, it inferred a distributed dual-port RAM. Check your synthesis report, it will say
Found 4x8-bit dual-port RAM <Mram_buf_mem> for signal <buf_mem>.
<snip>
INFO:Xst:3231 - The small RAM <Mram_buf_mem> will be implemented on LUTs in order to maximize performance and save block RAM resources. If you want to force its implementation on block, use option/constraint ram_style.
-----------------------------------------------------------------------
| ram_type | Distributed | |
-----------------------------------------------------------------------
| Port A |
| aspect ratio | 4-word x 8-bit | |
| clkA | connected to signal <write> | rise |
| weA | connected to signal <full> | low |
| addrA | connected to signal <write_idx> | |
| diA | connected to signal <data_in> | |
-----------------------------------------------------------------------
| Port B |
| aspect ratio | 4-word x 8-bit | |
| addrB | connected to signal <read_idx> | |
| doB | connected to internal node | |
-----------------------------------------------------------------------
When you now look at the technology schematic, you will see XST inferred three instances: Mram_buf_mem1, Mram_buf_mem21, Mram_buf_mem22. In my case anyway, yours might differ.
Mram_buf_mem1 is the input buffer for data_in(5:0), data_in(6) and data_in(7) are actually using Mram_buf_mem21 resp. Mram_buf_mem22. This is just an artifact of the design not being properly constrained (what's the clock period of read and write? etc.)
So, basically your message above
WARNING:Par:288 - The signal Mram_buf_mem1_RAMD_D1_O has no load. PAR will not attempt to route this signal.
means that some output signal the inferred dual-port distributed RAM provides (D1_O) is not being used (it drives no logic/flip flops). Therefore, the Place and Route (PAR) step will not even attempt to route it. With all this information we gathered, we can safely assume that this doesn't matter and won't affect your FIFO at all.
However, what will matter is the following: You did nothing to constrain paths between your two clock domains (read domain and write domain). This means, you might run into issues where write_idx is changing while read is performed and vice-versa. This might leave you stuck in one state with full not being deasserted or empty not being asserted, because you lack synchronization logic for signals that need to cross the clock domain.
XST will not insert this logic for you. You can check for these types of errors using the Asynchronous Delay Report and Clock Region Report.
Now, if you're just getting started with the world of FPGAs, you might want to play around a bit with inference of primitives vs. instantiation of primitives. Check the Spartan 6 HDL library guide to see what VHDL language construct will cause XST to infer a e.g. RAM, FIFO, flip flop, and which constructs will cause it to infer weird and cryptic logic constructs because of some unrealistic inferred timing/area constraints.
Finally, try to have synchronous logic as much as possible and properly constrain your design. Also, sorry for the long write-up if you were just looking for an easy two-liner...

Related

Extend the CUDA.#atomic enum to a custom struct

i was wondering weather it is possible to extend the CUDA.#atomic operation to a custom type.
Here is an example of what i am trying to do:
using CUDA
struct Dual
x
y
end
cu0 = CuArray([Dual(1, 2), Dual(2,3)])
cu1 = CuArray([Dual(1, 2), Dual(2,3)])
indexes = CuArray([1, 1])
function my_kernel(dst, src, idx)
index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
#inbounds if index <= length(idx)
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
end
return nothing
end
#cuda threads = 100 my_kernel(cu0, cu1, indexes)
The Problem of this code is that the CUDA.#atomic call only supports basic types like
Int, Float or Real.
I need it to work with my own struct.
Would be nice if someone has an idea how this could be possible.
The underlying PTX instruction set for CUDA provides a subset of atomic store, exchange, add/subtract,increment/decrement, min/max, and compare-and-set operations for global and shared memory locations (not all architectures support all operations with all POD types, and there is evidence that not all operations are implemented in hardware on all architectures).
What all these instructions have in common is that they execute only one operation atomically. I am completely unfamiliar with Julia, but if
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
means "atomically add src[].x and src[].y to dst[].x and dst[].y" then that isn't possible because that implies two additions on separate memory locations in one atomic operation. If the members of your structure could be packed into a compatible type (a 32 bit or 64 bit unsigned integer, for example), you could perform atomic store, exchange or compare-and-set in CUDA. But not arithmetic.
If you consult this section of the programming guide, you can see an example of a brute force double precision add implementation using compare-and-set in a tight loop. If your structure can be packed into something which can be manipulated with compare-and-set, then it might be possible to roll your own atomic add for a custom type (limited to a maximum of 64 bits).
How you might approach that in Julia is definitely an exercise left to the reader.

single-cycle MIPS timeing questions

I read the book "Computer Organiztion and Design", in chapter 4, it describes a single-cycle MIPS machine. however, I have several doubles about it.
If the data memory and the instruction memory in the design are SRAMs, how can any instructions be finished in a signle clock cycle . Take a load instruction as an example, I think the single-cycle MIPS design still has to go through the following stages. only the ID and EXE stage are merged.
| 1 | 2 | 3 | 4 |
| WB | | | |
| | IF | | |
| | | ID\EXE | |
| | | MEM |
if the data memory is updated at the negedge clock, the ID, EXE and MEM stage can be merged, but there are still three stages left.
Can any one explain how the "Single-Cycle" works? Thanks!
The single cycle processor that you read about in chapter 4 is a slight oversimplification over what is implementable in reality. They don't show some of the tricky details. For instance one timing assumption you could make is to assume that your memory reads are combinational and mempory writes take 1 positive edge to complete, i.e. simiar to the register file. So in that case when the clock edge arrives you have your IF stage populated with an instruction. Then for the duration of that cycle, you decode and execute the instruciton and writeback happens on the next clock edge. In case it is a data store same thing is true, the memory will be written on the next clock edge. In case of loads, you are assuming combinational memory read, so your data will arrive before the clock edge and on the edge it will be written to the register file.
Now, this is not the best way to implement this and you have to make several assumptions. In a slightly more realistic unpipelined processor you could have a stall signal that would roll over to the next cycle if you are waiting for the memory request. So you can imagine you would have a Stall_on_IF signal and Stall_on_LD signal that would tell you to stall this cycle until your instruction/data arrives. When they do arrive, you latch them and continue execution next cycle.

How to diag imprecise bus fault after config of priority bit allocation, Cortex M3 STM32F10x w uC/OS-III

I have an issue in an app written for the ST Microelectronics STM32F103 (ARM Cortex-M3 r1p1). RTOS is uC/OS-III; dev environment is IAR EWARM v. 6.44; it also uses the ST Standard Peripheral Library v. 1.0.1.
The app is not new; it's been in development and in the field for at least a year. It makes use of two UARTs, I2C, and one or two timers. Recently I decided to review interrupt priority assignments, and I rearranged priorities as part of the review (things seemed to work just fine).
I discovered that there was no explicit allocation of group and sub-priority bits in the initialization code, including the RTOS, and so to make the app consistent with another app (same product, different processor) and with the new priority scheme, I added a call to NVIC_PriorityGroupConfig(), passing in NVIC_PriorityGroup_2. This sets the PRIGROUP value in the Application Interrupt and Reset Control Register (AIRCR) to 5, allocating 2 bits for group (preemption) priority and 2 bits for subpriority.
After doing this, I get an imprecise bus fault exception on execution, not immediately but very quickly thereafter. (More on where I suspect it occurs in a moment.) Since it's imprecise (BFSR.IMPRECISERR asserted), there's nothing of use in BFAR (BFSR.BFARVALID clear).
The STM32F family group implements 4 bits of priority. While I've not found this mentioned explicitly anywhere, it's apparently the most significant nybble of the priority. This assumption seems to be validated by the PRIGROUP table given in documentation (p. 134, STM32F10xxx/20xxx/21xxx/L1xxx Cortex-M3 Programming Manual (Doc 15491, Rev 5), sec. 4.4.5, Application interrupt and control register (SCB_AIRCR), Table 45, Priority grouping, p. 134).
In the ARM scheme, priority values comprise some number of group or preemption priority bits and some number of subpriority bits. Group priority bits are upper bits; subpriority are lower. The 3-bit AIRCR.PRIGROUP value controls how bit allocation for each are defined. PRIGROUP = 0 configures 7 bits of group priority and 1 bit of subpriority; PRIGROUP = 7 configures 0 bits of group priority and 8 bits of subpriority (thus priorities are all subpriority, and no preemption occurs for exceptions with settable priorities).
The reset value of AIRCR.PRIGROUP is defined to be 0.
For the STM32F10x, since only the upper 4 bits are implemented, it seems to follow that PRIGROUP = 0, 1, 2, 3 should all be equivalent, since they all correspond to >= 4 bits of group priority.
Given that assumption, I also tried calling NVIC_PriorityGroupConfig() with a value of NVIC_PriorityGroup_4, which corresponds to a PRIGROUP value of 3 (4 bits group priority, no subpriority).
This change also results in the bus fault exception.
Unfortunately, the STM32F103 is, I believe, r1p1, and so does not implement the Auxiliary Control Register (ACTLR; introduced in r2p0), so I can't try out the DISDEFWBUF bit (disables use of the write buffer during default memory map accesses, making all bus faults precise at the expense of some performance reduction).
I'm almost certain that the bus fault occurs in an ISR, and most likely in a UART ISR. I've set a breakpoint at a particular place in code, started the app, and had the bus fault before execution hit the breakpoint; however, if I step through code in the debugger, I can get to and past that breakpoint's location, and if I allow it to execute from there, I'll see the bus fault some small amount of time after I continue.
The next step will be to attempt to pin down what ISR is generating the bus fault, so that I can instrument it and/or attempt to catch its invocation and step through it.
So my questions are:
1) Anyone have any suggestions as to how to go about identifying the origin of imprecise bus fault exceptions more intelligently?
2) Why would setting PRIGROUP = 3 change the behavior of the system when PRIGROUP = 0 is the reset default? (PRIGROUP=0 means 7 bits group, 1 bit sub priority; PRIGROUP=3 means 4 bits group, 4 bits sub priority; STM32F10x only implements upper 4 bits of priority.)
Many, many thanks to everyone in advance for any insight or non-NULL pointers!
(And of course if I figure it out beforehand, I'll update this post with any information that might be useful to others encountering the same sort of scenario.)
Even if BFAR is not valid, you can still read other related registers within your bus-fault ISR:
void HardFault_Handler_C(unsigned int* hardfault_args)
{
printf("R0 = 0x%.8X\r\n",hardfault_args[0]);
printf("R1 = 0x%.8X\r\n",hardfault_args[1]);
printf("R2 = 0x%.8X\r\n",hardfault_args[2]);
printf("R3 = 0x%.8X\r\n",hardfault_args[3]);
printf("R12 = 0x%.8X\r\n",hardfault_args[4]);
printf("LR = 0x%.8X\r\n",hardfault_args[5]);
printf("PC = 0x%.8X\r\n",hardfault_args[6]);
printf("PSR = 0x%.8X\r\n",hardfault_args[7]);
printf("BFAR = 0x%.8X\r\n",*(unsigned int*)0xE000ED38);
printf("CFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED28);
printf("HFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED2C);
printf("DFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED30);
printf("AFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED3C);
printf("SHCSR = 0x%.8X\r\n",SCB->SHCSR);
while (1);
}
If you can't use printf at the point in the execution when this specific Hard-Fault interrupt occurs, then save all the above data in a global buffer instead, so you can view it after reaching the while (1).
Here is the complete description of how to connect this ISR to the interrupt vector (although, as I understand from your question, you already have it implemented):
Jumping from one firmware to another in MCU internal FLASH
You might be able to find additional information on top of what you already know at:
http://www.keil.com/appnotes/files/apnt209.pdf

VHDL Full adder test bench output U

I'm a vhdl newbie.i wrote codings for a full adder using only AND and OR gates. I created a testbench to test my code and had it configured to stimulate all of eight logic combinations for A,B and Cin. when i run the simulation wave forms are correct for the inputs but the sum out put(H in my case) shows just "U".any ideas please.
library ieee;
use ieee.std_logic_1164.all;
--This program describes behaviour of a full adder constructed using
-- AND and OR gates. It employs component method which describes behaviour of
--AND,OR and NOT gates and use them to build the final object
--Entity description of the full adder
entity full_add is
port (a,b,c:IN BIT; h:OUT BIT);
end entity full_add;
--Description the 3 input And Gate
entity And2 is
port (j,k,l:in BIT; m:out BIT);
end entity And2;
architecture ex1 of And2 is
begin
m <= (j AND k AND l);
end architecture ex1;
--Description of the four input OR gate
entity Or2 is
port (d,e,f,g:IN BIT; h:OUT BIT);
end entity Or2;
architecture ex1 of Or2 is
begin
h <= (d or e or f or g);
end architecture ex1;
--Description of the NOT gate
entity Not1 is
port(x:in BIT; y:out BIT);
end entity Not1;
architecture ex1 of Not1 is
begin
y <= not x;
end architecture ex1;
--Components and wiring description
architecture netlist of full_add is
signal s,u,v,s2,u2,v2,w2:BIT;
begin
g1:entity WORK.Not1(ex1) port map(a,s);
g2:entity WORK.Not1(ex1) port map(b,u);
g3:entity WORK.Not1(ex1) port map(c,v);
g4:entity WORK.And2(ex1) port map(s,u,c,s2);
g5:entity WORK.And2(ex1) port map(s,b,v,u2);
g6:entity WORK.And2(ex1) port map(a,u,v,v2);
g7:entity WORK.And2(ex1) port map(a,b,v,w2);
g8:entity WORK.Or2(ex1) port map (s2,u2,v2,w2,h);
end architecture netlist;
You are going to have to debug the implementation : this will be quite a good exercise in using the simulator!
You see that "H" has value 'U' but you don't yet know why. Trace all the drivers of H back : in the posted code I can only see one, which is derived from inputs S2, U2, V2, W2.
In the simulator, add these signals to the Wave window and re-run the simulation : is one of them stuck at 'U'? If so, trace that signal back to find out why. If they all have valid values, that points to something else driving 'U' onto signal H.
Learn the "Drivers" command for your simulator (i.e. find and read the manual!) to identify all the signals driving H, and their values at a given time. There may be a driver in the (hidden to us) testbench. If so, change its driving value (perhaps with an H <= 'Z';" assignment) and re-run the simulation.
Essentially : learn and practice basic simulation debugging skills : one of these skills is likely to resolve the problem. And edit the question with what you have learned from them : if the problem persists, these results will point closer to it.

error in Assigning values to bytes in a 2d array of registers in Verilog .Error

Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!
Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.