Output of arbiter not stable - chisel

I have just found out this problem.
Suppose I use a Arbiter to arbitrate the output of a bus driver from multiple parallel transaction initiators. The bus and initiators use DecoupledIO. It is known that Arbiter prioritizes in(0) over in(1). Considering this case:
clock 1: in(0).valid = 0, in(1).valid = 1 -> out === in(1) out.valid = 1 out.ready = 0
clock 2: in(1).valid = 1, in(1).valid = 1 -> out === in(0) out.valid = 1 out.ready = 1
So both clock 1 and 2 have bus.valid === 1
If a client on this bus cannot response in the same cycle but the next cycle,
the out.ready driven by this client is actually corresponding to in(1) NOT in(0) in clock 2.
I would expect the arbiter to choose in(0) if in(0) and in(1) become valid at the same clock cycle, but if in(1) turns valid before in(0), the arbiter keeps selecting in(1) until in(1) is fired.
In this case, LockingArbiter, RRArbiter all have the same behaviour, that higher priority input can always preempt lower priority input before the lower input is locked (when count == 1, there is no lock at all).
I am kind of seeing this non-stable output as a bug-like issue of Arbiter.
Is there a work-around for this?

A "needsHold" parameter is added to all arbiters to enable this hold requirement. This feature is disabled by default. This is included in Chisel by commit 18ecaf8de4a5.

Related

Format number with variable amount of significant figures depending on size

I've got a little function that displays a formatted amount of some number value. The intention is to show a "commonsense" amount of significant figures depending on the size of the number. So for instance, 1,234 comes out as 1.2k while 12,345 comes out as 12k and 123,456 comes out as 123k.
So in other words, I want to show a single decimal when on the lower end of a given order of magnitude, but not for larger values where it would just be useless noise.
I need this function to scale all the way from 1 to a few billion. The current solution is just to branch it:
-- given `current`
local text = (
current > 9,999,999,999 and ('%dB') :format(current/1,000,000,000) or
current > 999,999,999 and ('%.1fB'):format(current/1,000,000,000) or
current > 9,999,999 and ('%dM') :format(current/1,000,000) or
current > 999,999 and ('%.1fM'):format(current/1,000,000) or
current > 9,999 and ('%dk') :format(current/1,000) or
current > 999 and ('%.1fk'):format(current/1,000) or
('%d'):format(current) -- show values < 1000 floored
)
textobject:SetText(text)
-- code formatted for readability
Which I feel is very ugly. Is there some elegant formula for rounding numbers in this fashion without just adding another (two) clauses for every factor of 1000 larger I need to support?
I didn't realize how simple this actually was until a friend gave me a solution (which checked the magnitude of the number based on its length). I converted that to use log to find the magnitude, and now have an elegant working answer:
local suf = {'k','M','B','T'}
local function clean_format(val)
if val == 0 then return '0' end -- *Edit*: Fix an error caused by attempting to get log10(0)
local m = math.min(#suf,math.floor(math.log10(val)/3)) -- find the magnitude, or use the max magnitude we 'understand'
local n = val / 1000 ^ m -- calculate the displayed value
local fmt = (m == 0 or n >= 10) and '%d%s' or '%.1f%s' -- and choose whether to apply a decimal place based on its size and magnitude
return fmt:format(n,suf[m] or '')
end
Scaling it up to support a greater factor of 1000 is as easy as putting the next entry in the suf array.
Note: for language-agnostic purposes, Lua arrays are 1-based, not zero based. The above solution would present an off-by-one error in many other languages.
Put your ranges and their suffixes inside a table.
local multipliers = {
{10^10, 'B', 10^9},
{10^9, 'B', 10^9, true},
{10^7, 'M', 10^6},
{10^6, 'M', 10^6, true},
{10^4, 'k', 10^3},
{10^3, 'k', 10^3, true},
{1, '', 1},
}
The optional true value at the 4th position of alternate variables is for the %.1f placeholder. The third index is for the divisor.
Now, iterate over this table (using ipairs) and format accordingly:
function MyFormatter( current )
for i, t in ipairs( multipliers ) do
if current >= t[1] then
local sHold = (t[4] and "%.1f" or "%d")..t[2]
return sHold:format( current/t[3] )
end
end
end

Outputting a bitstream onto a pin in verilog

I need to output a 32bit bit-stream onto a pin in verilog. I know verilog has the streaming operators pack and unpack but I do not believe they will do what I want it to do.
I have 32x512 FIFO RAM in which data is stored. Data for the variable "I" stored on the first 32 bits and the data for variable "Q" is stored on the next 32 bits (the rest of FIFO saves data in this alternating fashion). I need to continually get a 32bit stream off the FIFO RAM and output the 32bit data stream onto a pin. My FIFO has three output signals(a signal for the 32 bit data stream(32_data), a signal to say when the FIFO is empty (32_empty), and a signal to say when the FIFO is full(32_full)) My sudo code is the following (It's sudo code because I know how to do everything else but the part I need help with and I wanted to keep it simple for understanding):
process # posedge clock
begin
if (32_empty != 1) then //if the FIFO has data
if (32_full == 1) then //if the FIFO is full, then we lose data (for testing purposes to know if I need to make the RAM bigger
PIN_1 <= 1; //output onto a pin that the FIFO is full
PIN_2 <= 0; //clear pin 2 from outputting data for "I"
PIN_3 <= 0; //clear pin 3 from outputting data for "Q"
else if (en_Q == 0)
(stream 32bit data for variable "I" onto pin 2) //variable "I" output//HELP-This is where I need help figuring out how to stream the output, 32_data, onto a pin
en_Q <= ~en_Q; // toggle en_Q so next 32bit stream will be for "Q"
else if (en_Q ==1)
(stream 32bit data for variable "Q" onto pin 3) //variable "Q" output//HELP-This is where I need help figuring out how to stream the output, 32_data, onto a pin
en_Q <= ~en_Q; // toggle en_Q so next 32bit stream will be for "I"
end
If you could help me with figuring out how to stream a 32 bit data stream onto a pin, that would be great!
Thanks in advance
I have added the suggestion. Could I put the data on the pins with a for loop? The following is my code segment and the bottom part is the shift register and outputting to a pin:
`// Wires and registers related to data capturing
wire capture_clk;
reg [31:0] capture_data;
wire capture_en;
reg [4:0] slowdown;
wire capture_full;
reg capture_open;
reg capture_open_cross;
reg capture_has_been_full;
reg capture_has_been_nonfull;
reg has_been_full_cross;
reg has_been_full;
// Data capture section
// ====================
always #(posedge capture_clk)
begin
if (capture_en)
capture_data <= user_w_write_32_data; // Bogus data source
// The slowdown register limits the data pace to 1/32 the bus_clk
// when capture_clk = bus_clk. This is necessary, because the
// core in the evaluation kit is configured for simplicity, and
// not for performance. Sustained data rates of 200 MB/sec are
// easily reached with performance-oriented setting.
// The slowdown register has no function in a real-life application.
slowdown <= slowdown + 1;
// capture_has_been_full remembers that the FIFO has been full
// until the file is closed. capture_has_been_nonfull prevents
// capture_has_been_full to respond to the initial full condition
// every FIFO displays on reset.
if (!capture_full)
capture_has_been_nonfull <= 1;
else if (!capture_open)
capture_has_been_nonfull <= 0;
if (capture_full && capture_has_been_nonfull)
capture_has_been_full <= 1;
else if (!capture_open)
capture_has_been_full <= 0;
end
// The dependency on slowdown is only for bogus data
assign capture_en = capture_open && !capture_full &&
!capture_has_been_full &&
(slowdown == 0);
// Clock crossing logic: bus_clk -> capture_clk
always #(posedge capture_clk)
begin
capture_open_cross <= user_r_read_32_open;
capture_open <= capture_open_cross;
end
// Clock crossing logic: capture_clk -> bus_clk
always #(posedge bus_clk)
begin
has_been_full_cross <= capture_has_been_full;
has_been_full <= has_been_full_cross;
end
// The user_r_read_32_eof signal is required to go from '0' to '1' only on
// a clock cycle following an asserted read enable, according to Xillybus'
// core API. This is assured, since it's a logical AND between
// user_r_read_32_empty and has_been_full. has_been_full goes high when the
// FIFO is full, so it's guaranteed that user_r_read_32_empty is low when
// that happens. On the other hand, user_r_read_32_empty is a FIFO's empty
// signal, which naturally meets the requirement.
assign user_r_read_32_eof = user_r_read_32_empty && has_been_full;
assign user_w_write_32_full = 0;
// The data capture clock here is bus_clk for simplicity, but clock domain
// crossing is done properly, so capture_clk can be an independent clock
// without any other changes.
assign capture_clk = bus_clk;
async_fifo_32x512 fifo_32 //FIFO created using Xilinx FIFO Generator Wizard
(
.rst(!user_r_read_32_open),
.wr_clk(capture_clk),
.rd_clk(bus_clk),
.din(capture_data),
.wr_en(capture_en),
.rd_en(user_r_read_32_rden),
.dout(user_r_read_32_data),
.full(capture_full),
.empty(user_r_read_32_empty)
);
reg Q_en = 1'b0; //starting value is 0 because first 32bit is I
reg [31:0] data_outI;
reg [31:0] data_outQ;
reg I;
reg Q;
integer counter;
always #(posedge bus_clk) begin
if(Q_en == 1'b0) begin //To output to I signal
data_outI <= user_r_read_32_data;
for (counter = 0; counter < 32; counter = counter + 1) begin //output to pins
I = data_outI[0];
data_outI <= (data_outI >> 1);
Q = data_outQ[0];
data_outQ <= (data_outQ >> 1);
end
Q_en <= ~Q_en;
end
else if(Q_en == 1'b1) begin //To output to Q signal
data_outQ <= user_r_read_32_data;
for (counter = 0; counter < 32; counter = counter + 1) begin //output to pins
I = data_outI[0];
data_outI <= (data_outI >> 1);
Q = data_outQ[0];
data_outQ <= (data_outQ >> 1);
end
Q_en <= ~Q_en;
end
end
assign PS_GPIO_ONE_I = I; //Assign Pin I
assign PS_GPIO_TWO_Q = Q; //Assign Pin Q
`
Basically, you'd want to do something like the following:
Fetch an item from the fifo into a 32 bit register (lets call data)
Each clock cycle, put the lsb of data onto the pin, and then right shift data by one value.
Keep repeating this shifting for 32 clock cycles, until all of the data has been shifted out.
Toggle the value of en_Q, and fetch another 32 bit item.
You should be able to make a small state machine that can handle this sequence. You can't shift out all 32 bits in a single clock cycle as you have done in your pseudo-code, unless you also have a 32x clock available, and that would likely be a more complicated design.

Shared Memory in cuda fortran not working as expected

I am building a cuda fortran and a strange behavior occurs. I don't really understand why my code runs like this and would appreciate your help.
It seems that the value 0 is never assigned and even the loops
executes beyond the boarders.
I tried to put the if condition after the loops but it did not help either.
Thank you for your help
real, shared :: s_d_aaa_adk(0:15,0:15)
real, shared :: s_d_bbb_adk(0:15,0:15)
real, shared :: s_d_ccc_adk(0:15,0:15)
d_k = (blockIdx%x-1)
s_d_j = threadIdx%x-1
s_d_l = threadIdx%y-1
if(d_k == kmax-1)then
s_d_aaa_adk(s_d_j,s_d_l) = 0
s_d_bbb_adk(s_d_j,s_d_l) = 0
s_d_ccc_adk(s_d_j,s_d_l) = 0
endif
do d_k = 0, kmax-2
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)
end do `
I set all global memory array size to be (16,16, kmax),
the grid is (128,1,1), block (16,16,1), and the
the kernel is launched as testkernell<<<grid,block>>>()
Since you're conditioning the if statement on d_k, which is derived from the block index:
d_k = (blockIdx%x-1)
if(d_k == kmax-1)then
This means that only one block out of the 128 in your grid will actually execute the if statement, setting those particular shared memory values to zero. Most of your blocks will not execute what's inside the if statement.
And if kmax happens to be greater than 128, then none of your blocks will execute the if statement.
If you want that if-statement to be executed within every threadblock, you will need to condition it on something other than the block index.
I would make a suggestion about how to restructure the code, but it's not clear to me what you want to achieve as far as loading data into shared memory. For instance, your do-loop doesn't make much sense to me:
do d_k = 0, kmax-2
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)
end do ^ ^
| |
a given thread has specific values for these indices
Your s_d_j and s_d_l variables are thread indices. So a given thread will see this do loop, and it will execute the loop iteratively, loading successive values from the various global memory arrays (d_bbb, d_ccc, etc.) into the exact same locations in each shared memory array.
It seems to me you don't really understand how thread execution works. Pretend that you are a given thread, assign specific values to s_d_j and s_d_l (and d_k, although you are overwriting the block index when re-use that variable as your loop index, which also seems strange to me), and then see if your code execution makes sense.
EDIT: Based on additional comments:
You have stated your overall data set size (x,y,z) is (64,64,32).
You have stated "I am slicing ... array through z. ... I want to put each slice in one block"
That would suggest to me that you should launch one block per slice. Or maybe you have an algorithm in mind that has multiple blocks assigned to a single slice. Regardless, I will assume that you want all the slice data (64, 64) available to a given block that is assigned to that slice. I will assume for now that you will launch 32 blocks. It should not be difficult to extend to the case where multiple blocks are working on a single slice. I will also assume a 32x32 thread block rather than 16x16 that you have indicated. It should not be difficult to extend this to use 16x16 if you want to.
You might do something like this then:
real, shared :: s_d_aaa_adk(0:63,0:63)
real, shared :: s_d_bbb_adk(0:63,0:63)
real, shared :: s_d_ccc_adk(0:63,0:63)
c above uses 48KB of shared mem, so assuming cc 2.0+ and cache config set accordingly
d_k = (blockIdx%x-1)
s_d_j = threadIdx%x-1
s_d_l = threadIdx%y-1
c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(s_d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(s_d_j,s_d_l,d_k+1)
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = d_bbb(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = d_ccc(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = d_aaa(s_d_j+blockDim%x,s_d_l,d_k+1)
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = d_bbb(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = d_ccc(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = d_aaa(s_d_j,s_d_l+blockDim%y,d_k+1)
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_bbb(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_ccc(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_aaa(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
c just guessing about what your intent was on filling with zeroes
c this just makes sure that one of the slices at the end gets zeroes
c instead of the values from the global arrays
if(d_k == kmax-1)then
c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = 0
s_d_ccc_adk(s_d_j,s_d_l) = 0
s_d_aaa_adk(s_d_j,s_d_l) = 0
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = 0
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = 0
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = 0
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = 0
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = 0
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = 0
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
endif

Matlab Newbie Binary Search Troubleshoot

I am a newbie to Matlab/programming in general. I wish to write a program/script that uses recursive binary search to approximate the root of $2x - 3sin(x)+5=0$, such that the iteration terminates once the truncation error is definitely $< 0.5 \times 10 ^{-5}$ and print out the number of iterations as well as the estimate of the root.
Here is my attempt that seems to have broken my computer...
%Approximating the root of f(x) = 2*x - 3*sin(x) + 5 by binary search
%Define variables
low = input('Enter lower bound of range: ');
high = input('Enter upper bound of range: ');
mid = (low + high)/2;
%Define f_low & f_high
f_low = 2*low - 3*sin(low) + 5;
f_high = 2*high - 3*sin(high) + 5;
f_mid = 2*mid - 3*sin(mid) + 5;
%Check that the entered range contains the key
while (f_low * f_high) > 0 || low > high
disp('Invalid range')
low = input('Enter lower bound of range: ');
high = input('Enter upper bound of range: ');
end
%The new range
while abs(f_mid) > 0.5*10^(-5)
if f_mid < 0
low = mid;
elseif f_mid > 0
high = mid;
end
end
fprintf('mid = %.4f \n', mid)
I haven't even added in the number-of-iterations counting bit (which I am not quite sure how to do) and already I am stuck.
Thanks for any help.
Once you set high=mid or low=mid, is mid and f_mid recalculated? It looks like you will fail if f_low>0 and f_high<0. This is a valid condition, but you are choosing the wrong one to reset in this case. Also, your termination check is on the function value, not the difference between low and high. This may be what you want, or maybe you want to check both ways. For very flat functions you may not be able to get the function value that small.
You don't need f_mid, and is in fact misleading you. You just need to calculate the value at each step, and see which direction to go.
Plus, you are just changing low and high, but you do not evaluate again f_low or f_high. Matlab is not an algebra system (there are modules for symbolic computation, but that's a different story), so you did not define f_low and f_high to change with the change of low and high: you have to reevaluate them in your final loop.

most readable programming language to simulate 10,000 chutes and ladders game plays?

I'm wondering what language would be most suitable to simulate the game Chutes and Ladders (Snakes and Ladders in some countries). I'm looking to collect basic stats, like average and standard deviation of game length (in turns), probability of winning based on turn order (who plays first, second, etc.), and anything else of interest you can think of. Specifically, I'm looking for the implementation that is most readable, maintainable, and modifiable. It also needs to be very brief.
If you're a grown-up and don't spend much time around young kids then you probably don't remember the game that well. I'll remind you:
There are 100 squares on the board.
Each player takes turn spinning a random number from 1-6 (or throwing a dice).
The player then advances that many squares.
Some squares are at the base of a ladder; landing on one of these squares means the player gets to climb the ladder, advancing the player's position to a predetermined square.
Some squares are at the top of a slide (chute or snake); landing on one of these squares means the player must slide down, moving the player's position back to a predetermined square.
Whichever player gets to position 100 first is the winner.
This is a bit rough, but it should work:
class Board
attr_accessor :winner
def initialize(players, &blk)
#chutes, #ladders = {}, {}
#players = players
#move = 0
#player_locations = Hash.new(0)
self.instance_eval(&blk)
end
def chute(location)
#chutes[location[:from]] = location[:to]
end
def ladder(location)
#ladders[location[:from]] = location[:to]
end
def spin
player = #move % #players
die = rand(6) + 1
location = (#player_locations[player] += die)
if endpoint = #chutes[location] || endpoint = #ladders[location]
#player_locations[player] = endpoint
end
if #player_locations[player] >= 100
#winner = player
end
#move += 1
end
end
num_players = 4
board = Board.new num_players, do
ladder :from => 4, :to => 14
ladder :from => 9, :to => 31
# etc.
chute :from => 16, :to => 6
# etc.
end
until board.winner
board.spin
end
puts "Player #{board.winner} is the winner!"
You should check out something along the lines of Ruby or Python. Both are basically executable psuedocode.
You might be able to get a shorter, more brilliant program with Haskell, but I would imagine Ruby or Python would probably be actually understandable.
For many statistics, you don't need to simulate. Using Markov Chains, you can reduce many problems to matrix operations on a 100x100-matrix, which only take about 1 millisecond to compute.
I'm going to disagree with some of the earlier posters, and say that an object oriented approach is the wrong thing to do here, as it makes things more complicated.
All you need is to track the position of each player, and a vector to represent the board. If the board position is empty of a chute or ladder, it is 0. If it contains a ladder, the board contains a positive number that indicates how many positions to move forward. If it contains a chute, it contains a negative number to move you back. Just track the number of turns and positions of each player.
The actual simulation with this method is quite simple, and you could do it in nearly any programming language. I would suggest R or python, but only because those are the ones I use most these days.
I don't have a copy of chutes and ladders, so I made up a small board. You'll have to put in the right board:
#!/usr/bin/python
import random, numpy
board = [0, 0, 0, 3, 0, -3, 0, 1, 0, 0]
numplayers = 2
numruns = 100
def simgame(numplayers, board):
winner = -1
winpos = len(board)
pos = [0] * numplayers
turns = 0
while max(pos) < winpos:
turns += 1
for i in range(0, numplayers):
pos[i] += random.randint(1,6)
if pos[i] < winpos:
pos[i] += board[pos[i]]
if pos[i] >= winpos and winner == -1:
winner = i
return (turns, winner)
# simulate games, then extract turns and winners
games = [simgame(numplayers, board) for x in range(numruns)]
turns = [n for (n, w) in games]
winner = [w for (t, w) in games]
pwins = [len([p for p in winner if p == i]) for i in range(numplayers)]
print "runs:", numruns
print "mean(turns):", numpy.mean(turns)
print "sd(turns):", numpy.std(turns)
for i in range(numplayers):
print "Player", (i+1), "won with proportion:", (float(pwins[i])/numruns)
F# isn't too ugly as well, its hard to beat a functional language for conciseness:
#light
open System
let snakes_and_ladders = dict [(1,30);(2,5);(20,10);(50,11)]
let roll_dice(sides) =
Random().Next(sides) + 1
let turn(starting_position) =
let new_pos = starting_position + roll_dice(6)
let found, after_snake_or_ladder = snakes_and_ladders.TryGetValue(new_pos)
if found then after_snake_or_ladder else new_pos
let mutable player_positions = [0;0]
while List.max player_positions < 100 do
player_positions <- List.map turn player_positions
if player_positions.Head > 100 then printfn "Player 1 wins" else printf "Player 2 wins"
I remember about 4 years ago being it a Top Coders competition where the question was what is the probability that player 1 win at snakes and ladders.
There was a really good write-up how to do the puzzle posted after the match. Can't find it now but this write-up is quite good
C/C++ seemed enough to solve the problem well.
Pick any object oriented language, they were invented for simulation.
Since you want it to be brief (why?), pick a dynamically typed language such as Smalltalk, Ruby or Python.