single-cycle MIPS timeing questions - mips

I read the book "Computer Organiztion and Design", in chapter 4, it describes a single-cycle MIPS machine. however, I have several doubles about it.
If the data memory and the instruction memory in the design are SRAMs, how can any instructions be finished in a signle clock cycle . Take a load instruction as an example, I think the single-cycle MIPS design still has to go through the following stages. only the ID and EXE stage are merged.
| 1 | 2 | 3 | 4 |
| WB | | | |
| | IF | | |
| | | ID\EXE | |
| | | MEM |
if the data memory is updated at the negedge clock, the ID, EXE and MEM stage can be merged, but there are still three stages left.
Can any one explain how the "Single-Cycle" works? Thanks!

The single cycle processor that you read about in chapter 4 is a slight oversimplification over what is implementable in reality. They don't show some of the tricky details. For instance one timing assumption you could make is to assume that your memory reads are combinational and mempory writes take 1 positive edge to complete, i.e. simiar to the register file. So in that case when the clock edge arrives you have your IF stage populated with an instruction. Then for the duration of that cycle, you decode and execute the instruciton and writeback happens on the next clock edge. In case it is a data store same thing is true, the memory will be written on the next clock edge. In case of loads, you are assuming combinational memory read, so your data will arrive before the clock edge and on the edge it will be written to the register file.
Now, this is not the best way to implement this and you have to make several assumptions. In a slightly more realistic unpipelined processor you could have a stall signal that would roll over to the next cycle if you are waiting for the memory request. So you can imagine you would have a Stall_on_IF signal and Stall_on_LD signal that would tell you to stall this cycle until your instruction/data arrives. When they do arrive, you latch them and continue execution next cycle.

Related

relation between warp scheduling and warp context switching in Cuda [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
As far as I understand, a ready warp is a warp that can be executed in warp scheduling. A waiting warp is waiting for source operands to be fetched or computed so that it can't be executed. Warp scheduler chooses a ready warp to execute for "warp scheduling".
On the other hand, when a warp has a pipeline stall or a long global memory latency, another warp will be brought into execution to hide the latency. This is the basic idea of "warp context switching" in cuda.
My question is: What is the relation between warp scheduling and warp context switching in Cuda. To elaborate my question, below is a example.
E.g. When warp A is stalled, and warp A is a waiting warp for global memory to be fetched, once the element is fetched, warp A will be scheduled or switched into the ready warp pool. Based on this, warp context switching is a part of warp scheduling. Is it correct?
Can anyone also provide any references on the warp context switching and warp scheduling in Cuda? It seems Nvidia does not make these documents publicly available.
Thanks in advance for any reply.
The ready warps are those which can be scheduled on the next cycle. Stalled warps cannot be scheduled.
To answer the question about latency with an extremely simplified example, suppose that the latency to main memory is 8 execution cycles, and let's ignore the fact that the machine is pipelined. Let's assume all instructions can execute in one cycle, if the data is ready.
Now suppose I have C code like this:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int myval = global_data[idx]*global_data[idx];
That is, myval should contain the square of an item in global memory, when the code is complete. This will be decomposed into a sequence of assembly language instructions. Let's suppose they look something like this:
I0: R0 = global_data[idx];
I1: R1 = R0 * R0;
I2: ...
Every thread can execute the first line of code (initially there are no stalls); there is no dependency yet, and a read by itself does not cause a stall. However every thread can then move on to the second line of code, and now the value of R0 must be correct, so a stall occurs, waiting for the read to be retrieved. As mentioned already, suppose the latency is 8 cycles, and using a warp with of 32 and a threadblock size of 512, we have a total of 16 warps. Let's suppose for simplicity we have a Fermi SM with only 32 units of execution. The sequence will look something like this:
cycle: ready warps: executing warp: instruction executed: Latency:
0 1-16 0 I0 -> I1 (stall) --
1 2-16 1 I0 -> I1 (stall) | --
2 3-16 2 I0 -> I1 (stall) | |
3 4-16 3 I0 -> I1 (stall) | |
4 5-16 4 I0 -> I1 (stall) | |
5 6-16 5 I0 -> I1 (stall) | |
6 7-16 6 I0 -> I1 (stall) | |
7 8-16 7 I0 -> I1 (stall) | |
8 0,9-16 8 I0 -> I1 (stall) <- |
9 1,9-16 0 I1 -> I2 <----
What we see is that after the latency is fulfilled by executing instructions from other warps, a previously "stalled" warp will re-enter the ready warp pool, and it's possible for the scheduler to schedule that warp again (i.e. to do the multiply operation contained in I1) on the very next cycle after the stall condition is removed.
There is no contradiction between latency hiding and warp scheduling. They work together, for a code with sufficient work to do, to hide the latency associated with various operations, such as reading from global memory.
The above example is a simplification compared to actual behavior, but it adequately represents the concepts of latency hiding and warp scheduling, to demonstrate how warp scheduling, in the presence of "enough work to do", can hide latency.

Understanding Recursion and the Traversal of Stack Frames

This is not a homework question. I'm merely trying to understand the process for my own edification. As a computer science student, I have attended several lectures where the concept of recursion was discussed. However, the lecturer was slightly vague, in my opinion, regarding the concept of a stack frame and how the call stack is traversed in order to calculate the final value. The manner in which I currently envision the process is analogous to building a tree from the top down (pushing items onto the call stack - a last in, first out data structure) then climbing the newly constructed tree where upon the final value is obtained upon reaching the top. Perhaps the canonical example:
def fact(n):
if n == 0:
ans = 1
else:
ans = n * fact(n-1)
return ans
value = fact(5)
print (value)
As indicated above, I think the call stack eventually resembles the following (crudely) drawn diagram:
+----------+
| 5 |
| 4 |
| 3 |
| 2 |
| 1 |
+----------+
Each number would be "enclosed" within a stack frame and control now proceeds from the bottom (with the value of 1) to the 2 then 3, etc. I'm not entirely certain where the operator resides in the process though. Would I be mistaken in assuming an abstract syntax tree (AST) in involved at some point or is a second stack present that contains the operator(s)?
Thanks for the help.
~Caitlin
Edit: Removed the 'recursion' tag and added 'function' and 'stackframe' tags.
The call stack frame stores arguments, return address and local variables. The code (not only the operator) itself is stored elsewhere. The same code is executed on different stack frames.
You can find some more information and visualization here: http://www.programmerinterview.com/index.php/recursion/explanation-of-recursion/
This question is more about how function calls work, rather than about recursion. When a function is called a frame is created and pushed on the stack. The frame includes a pointer to the calling code, so that the program knows where to return after the function call. The operator resides in the executable code, after the call point.

VHDL warning: PAR will not attempt to route this signal

I am learning VHDL and I am on the quest to implement my own FIFO buffer, but I have some problems. Since I want to deploy the code on a Xilinx Spartan 6 device I am using the Xilinx WebPack ISE with the associated VHDL compiler, but I am getting very weird warnings:
WARNING:Par:288 - The signal Mram_buf_mem1_RAMD_D1_O has no load. PAR will not attempt to route this signal.
WARNING:Par:283 - There are 1 loadless signals in this design. This design will cause Bitgen to issue DRC warnings.
Here is my code:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity FIFO_buffer is
generic ( BUFFER_SIZE : positive := 4; -- # of words
WORD_WIDTH : positive := 8); -- # of bits per word
port ( data_in : in STD_LOGIC_VECTOR (WORD_WIDTH - 1 downto 0);
full : out STD_LOGIC := '0';
write : in STD_LOGIC;
data_out : out STD_LOGIC_VECTOR (WORD_WIDTH - 1 downto 0);
empty : out STD_LOGIC := '1';
read : in STD_LOGIC);
end FIFO_buffer;
architecture arch of FIFO_buffer is
type ram_t is array (0 to BUFFER_SIZE - 1) of std_logic_vector(WORD_WIDTH - 1 downto 0);
signal buf_mem : ram_t := (others => (others=>'0'));
signal read_idx : integer range 0 to BUFFER_SIZE - 1 := 0;
signal write_idx : integer range 0 to BUFFER_SIZE - 1 := 0;
signal buf_full : std_logic := '0';
signal buf_empty : std_logic := '0';
begin
writing_data: process(write)
begin
if(rising_edge(write)) then
if(buf_full = '0') then
buf_mem(write_idx) <= data_in;
write_idx <= write_idx + 1;
if(write_idx = read_idx)
then buf_full <= '1';
else buf_full <= '0';
end if;
end if;
end if;
end process;
reading_data: process(read)
begin
if(rising_edge(read)) then
if(buf_empty = '0') then
data_out <= buf_mem(read_idx);
read_idx <= read_idx + 1;
if(read_idx = write_idx)
then buf_empty <= '1';
else buf_empty <= '0';
end if;
end if;
end if;
end process;
full <= buf_full;
empty <= buf_empty;
end arch;
The error seems to be caused by the data_out <= buf_mem(read_idx); line in the reading_data process. Could anyone explain to me the reason for the warning? (I know that my code has some functional problems, but that should not affect the reason for the warning)
P.S. Since I have the code here let me ask one more question. How unwise is it to have a component (such as that FIFO buffer) which is not synchronised with the global clock?
I'll address your second question first, i.e. " How unwise is it to have a component (such as that FIFO buffer) which is not synchronised with the global clock?"
It depends on your requirements. Usually, you should clock your components, so you have synchronous logic and no weird glitches caused by asynchronous paths.
However, consider what you did here. You have clocked your component: rising_edge(read) and rising_edge(write). You will find in your synthesis report the following:
Primitive and Black Box Usage:
------------------------------
<snip>
# Clock Buffers : 2
# BUFGP : 2
<snip>
Clock Information:
------------------
-----------------------------------+------------------------+-------+
Clock Signal | Clock buffer(FF name) | Load |
-----------------------------------+------------------------+-------+
read | BUFGP | 11 |
write | BUFGP | 6 |
-----------------------------------+------------------------+-------+
This is because you're not using a combinational process. This will lead to all kinds of problems. You mention a Xilinx Spartan-6. You will get the following message along the line (usually an ERROR), assuming you did not accidentally place read and write at an optimal IOB/BUFG site pair:
Place:1109 - A clock IOB / BUFGMUX clock component pair have been found
that are not placed at an optimal clock IOB / BUFGMUX site pair. The clock
IOB component <write> is placed at site <A5>. The corresponding BUFG
component <write_BUFGP/BUFG> is placed at site <BUFGMUX_X2Y9>. There is only
a select set of IOBs that can use the fast path to the Clocker buffer, and
they are not being used. You may want to analyze why this problem exists and
correct it.
What this message explains in great verbosity is the following. FPGAs have dedicated routing networks for clocks, which assure low skew. (Check Xilinx UG382 for more). However, there are specific pins on the FPGA that can directly access this clock network. There, IOB (I/O Buffer) and BUFG[MUX] ([Multiplexed] Global [Clock] Buffer) are close-by, ensuring that the signal from the pin can be distributed really fast across the whole FPGA using dedicated clocking resources. You can check placement with the FPGA Editor. For instance, my write pin has to cross half the FPGA before being able to get routed using a global clock buffer. That's 3.878ns delay in my case.
The same applies for read, of course. So you see this is a bad idea. You should use dedicated clocking resources for your clocks and synchronize inputs and outputs to that.
Now, on to your main question. You have to be aware what your HDL actually describes.
You have two distinct processes, each with their own clock (read; write) that access the same memory. You have two distinct addresses as well (write_idx; read_idx).
Hence, the XST Synthesizer (that ISE uses) inferred a dual-port RAM. Because the depth as well as element width are both small, it inferred a distributed dual-port RAM. Check your synthesis report, it will say
Found 4x8-bit dual-port RAM <Mram_buf_mem> for signal <buf_mem>.
<snip>
INFO:Xst:3231 - The small RAM <Mram_buf_mem> will be implemented on LUTs in order to maximize performance and save block RAM resources. If you want to force its implementation on block, use option/constraint ram_style.
-----------------------------------------------------------------------
| ram_type | Distributed | |
-----------------------------------------------------------------------
| Port A |
| aspect ratio | 4-word x 8-bit | |
| clkA | connected to signal <write> | rise |
| weA | connected to signal <full> | low |
| addrA | connected to signal <write_idx> | |
| diA | connected to signal <data_in> | |
-----------------------------------------------------------------------
| Port B |
| aspect ratio | 4-word x 8-bit | |
| addrB | connected to signal <read_idx> | |
| doB | connected to internal node | |
-----------------------------------------------------------------------
When you now look at the technology schematic, you will see XST inferred three instances: Mram_buf_mem1, Mram_buf_mem21, Mram_buf_mem22. In my case anyway, yours might differ.
Mram_buf_mem1 is the input buffer for data_in(5:0), data_in(6) and data_in(7) are actually using Mram_buf_mem21 resp. Mram_buf_mem22. This is just an artifact of the design not being properly constrained (what's the clock period of read and write? etc.)
So, basically your message above
WARNING:Par:288 - The signal Mram_buf_mem1_RAMD_D1_O has no load. PAR will not attempt to route this signal.
means that some output signal the inferred dual-port distributed RAM provides (D1_O) is not being used (it drives no logic/flip flops). Therefore, the Place and Route (PAR) step will not even attempt to route it. With all this information we gathered, we can safely assume that this doesn't matter and won't affect your FIFO at all.
However, what will matter is the following: You did nothing to constrain paths between your two clock domains (read domain and write domain). This means, you might run into issues where write_idx is changing while read is performed and vice-versa. This might leave you stuck in one state with full not being deasserted or empty not being asserted, because you lack synchronization logic for signals that need to cross the clock domain.
XST will not insert this logic for you. You can check for these types of errors using the Asynchronous Delay Report and Clock Region Report.
Now, if you're just getting started with the world of FPGAs, you might want to play around a bit with inference of primitives vs. instantiation of primitives. Check the Spartan 6 HDL library guide to see what VHDL language construct will cause XST to infer a e.g. RAM, FIFO, flip flop, and which constructs will cause it to infer weird and cryptic logic constructs because of some unrealistic inferred timing/area constraints.
Finally, try to have synchronous logic as much as possible and properly constrain your design. Also, sorry for the long write-up if you were just looking for an easy two-liner...

WebGL model simplification

I'm currently planning a Web GL game and am starting to make the models for it, I need to know if anyone knows if say my model is 1X scale, my camera zooms/pans out from the object and my models becomes 0.1X scale, what kind of simplification happens by the WGL engine to the models in view?
I.E if I use a triangle as an example, here is it at 1X scale
And here is the triangle at 10% of the original size while keeping all complexity (sorry it's so faint)
While the triangle looks the same, the complexity isn't entirely necessary and could be simplified into perhaps into 4 triangles for performance.
I understand that WebGL is a state machine and perhaps nothing happens; the complexity of the model remains the same, regardless of scale or state but how do I resolve this for the best performance possible?
Since at 1X scale there could be only one or very few models in view but when zoomed to 0.1X scale there could be many hundreds. Meaning, if the complexity of the model is too high then performance takes a huge hit and the game becomes unresponsive/useless.
All advice is hugely appreciated.
WebGL doesn't simplify for you. You have to do it yourself.
Generally you compute the distance away from the camera depending on the distance display a different hand made model. Far away you display a low detail model, close up you display a high detail model. There are lots of ways to do this, which way you choose is up to you. For example
Use different high poly models close, low poly far away
This is the easiest and most common method. The problem with this method is you often see popping when the engine switches from using the low poly model to the high poly model. The three.js sample linked in another answer uses this technique. It creates a LOD object who's job it is to decide which of N models to switch between. It's up to you to supply the models.
Use low-poly far away, fade in the high poly one over it. Once the high poly one is completely obscuring the low poly one stop drawing the low-poly.
Grand Theft Auto uses this technique
Create low poly from high poly and morph between them using any number of techniques.
For example.
1----2----3----4 1--------------4
| | | | | |
| | | | | |
4----5----6----7 | |
| | | | <-----> | |
| | | | | |
8----9----10---11 | |
| | | | | |
| | | | | |
12---13---14---15 12-------------15
Jak and Daxter and Crash Team Racing (old games) use the structure above.
Far away only points 1,4,12,15 are used. Close up all 16 points are used.
Points 2, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14 can be placed anywhere.
Between the far and near distances all the points are morphed so the 16 point
mesh becomes the 4 point mesh. If you play Jak and Daxter #1 or Ratchet and Clank #1
you can see this morphing going on as you play. By the second version of those
games the artists got good at hiding the morphing.
Draw high poly up close, render high poly into a texture and draw a billboard in
the distance. Update the billboard slowly (ever N frames instead of every frame).
This is a technique used for animated objects. It was used in Crash Team Racing
for the other racers when they are far away.
I'm sure there are many others. There are algorithms for tessellating in real time to auto generate low-poly from high or describing your models in some other form (b-splines, meta-balls, subdivision surfaces) and then generate some number of polygons. Whether they are fast enough and produce good enough results is up to you. Most AAA games, as far as I know, don't use them
Search for 'tessellation'.
With it you can add or subtract triangles from your mesh.
Tessellation is closely related to LOD objects (level-of-detail).
Scale factor is just coefficient that all vertices of the mesh are being multiplied with, and with scale you simply stretch your mesh along axis.
Take a look at this Three.js example:
http://threejs.org/examples/webgl_lod.html (WASD/mouse to move around)

What is a bank conflict? (Doing Cuda/OpenCL programming)

I have been reading the programming guide for CUDA and OpenCL, and I cannot figure out what a bank conflict is. They just sort of dive into how to solve the problem without elaborating on the subject itself. Can anybody help me understand it? I have no preference if the help is in the context of CUDA/OpenCL or just bank conflicts in general in computer science.
For nvidia (and amd for that matter) gpus the local memory is divided into memorybanks. Each bank can only address one dataset at a time, so if a halfwarp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For gt200 gpus there are 16 banks (32banks for fermi), 16 or 32 banks for AMD gpus (57xx or higher: 32, everything below: 16)), which are interleaved with a granuity of 32bit (so byte 0-3 are in bank 1, 4-7 in bank 2, ..., 64-69 in bank 1 and so on). For a better visualization it basically looks like this:
Bank | 1 | 2 | 3 |...
Address | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 |...
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |...
...
So if each thread in a halfwarp accesses successive 32bit values there are no bank conflicts. An exception from this rule (every thread must access its own bank) are broadcasts:
If all threads access the same address, the value is only read once and broadcasted to all threads (for GT200 it has to be all threads in the halfwarp accessing the same address, iirc fermi and AMD gpus can do this for any number of threads accessing the same value).
The shared memory that can be accessed in parallel is divided into modules (also called banks). If two memory locations (addresses) occur in the same bank, then you get a bank conflict during which the access is done serially, losing the advantages of parallel access.
In simple words, bank conflict is a case when any memory access pattern fails to distribute IO across banks available in the memory system. The following examples elaborates the concept:-
Let us suppose we have two dimensional 512x512 array of integers and our DRAM or memory system has 512 banks in it. By default the array data will be layout in a way that arr[0][0] goes to bank 0, arr[0][1] goes to bank 1, arr[0][2] to bank 2 ....arr[0][511] goes to bank 511. To generalize arr[x][y] occupies bank number y. Now some code (as shown below) start accessing data in column major fashion ie. changing x while keeping y constant, then the end result will be that all consecutive memory access will hit the same bank--hence bank conflict.
int arr[512][512];
for ( j = 0; j < 512; j++ ) // outer loop
for ( i = 0; i < 512; i++ ) // inner loop
arr[i][j] = 2 * arr[i][j]; // column major processing
Such problems, usually, are avoided by compilers by buffering the array or using prime number of elements in the array.
(CUDA Bank Conflict)
I hope this will help..
this is very good explaination ...
http://www.youtube.com/watch?v=CZgM3DEBplE
http://en.wikipedia.org/wiki/Memory_bank
and
http://mprc.pku.cn/mentors/training/ISCAreading/1989/p380-weiss/p380-weiss.pdf
from this page, you can find the detail about memory bank.
but it is a little different from what is said by #Grizzly.
in this page, the bank is like this
bank 1 2 3
address|0, 3, 6...| |1, 4, 7...| | 2, 5,8...|
hope this would help