PIPELINE - mem(memory) and if(instruction fetch) - mips

In PIPELINE, MEM (memory) and IF (instruction fetch) are the same hardware element?
If are the same memory then 2 instructions can't load or store in the same cycle clock, I'm right?
MIPS processor diagram

MEM (memory) and IF (instruction fetch) are the same hardware element?
No, there are not, because a) why would they then be drawn as separate blocks, and b) code loads (== fetches) are not the same as data loads. Code fetches are used to understand what a new instruction wants to do with data — the function, and loads/stores are acts of obtaining arguments of that function.
If are the same memory then 2 instructions can't load or store in the same cycle clock, I'm right?
Both load and store are done inside MEM, not IF, stage. Because there is only one MEM block on the diagram, at most one memory-related operation can be done at each clock. This does not mean that the IF stage is necessarily blocked by MEM. Whether instruction/data memories are separate, or there is an instruction cache, would define, but it is outside the scope of the diagram you showed.

Related

Are there some practical examples that illustrate pipelining in datapath and control?

Pipelining in the data path is simply divvying/cutting (theoretically) the resources. But pipelining the control means each resource at piped stages gets the separate control signals?
For instance, in most of the RISC architectures, we have 5 stages of pipelining, and the Mem pipe stage has the separate control signal for load or store?
Are there some practical examples of control pipelining?
In a classic 5-stage pipeline, each stage of the pipe has inputs that come from the previous stage (except the first one, of course), and each stage of the pipe has outputs that go to the next stage (except the last one, of course).  It stands to reason that these inputs & outputs are comprised of both data and control signals.
The EX stage needs to know what ALU operation to perform (control: ALUOp) and the ALU input operands (data).
The MEM stage needs to know whether to read memory (control: MemRead) or to write memory (control: MemWrite) (plus size & type for extension, usually glossed over) and where to read (data: Address) and what to write (data: Write Data).
The WB stage needs to know whether to write a register (control: RegWrite) and what register to write (data: Write Register) and what value to write to the register (data: Write Data).
In the single stage processor, all these control signals are generate by lookup (using the opcode) in the ID stage.  When the processor is pipelined, either those signals are forwarded from one stage to another, or else, each stage would have to repeat lookup using the opcode (then opcode would need to be forwarded from one stage to another, in order for each stage to repeat the lookup, though it is possible that the opcode is forwarded anyway, perhaps for exceptions).  (I believe that repeating the lookup in each stage would incur costs (time & hardware) as compared with forwarding control signals, especially for WB which is supposed to execute in the first half of a cycle.)
Because the WB stage needs to know whether to write a register, that information (control: RegWrite) must be passed to it from the MEM stage, which gets it from the EX stage, which gets it from the ID stage, where it is generated by lookup of the opcode.  EX & MEM don't use the RegWrite control signal, but must accept it as an input so as to pass it through as output to the next stage.
Similar is true for control signals needed by MEM: MemRead and MemWrite, which are generated in ID, passed from EX to MEM (not used in EX), and MEM need not pass these further, since WB also doesn't use those signals.
If you look in chapter 4 of Computer Organization and Design RISC-V edition, towards the end of the chapter (Fig 4.44 in the 1st edition), it shows the control signals output from one stage passing through stage pipeline registers and into the next intermediate stage. For example, Instruction [30, 14-12] is fed into ID/EX and then read by ALU Control in the EX stage. That is an example of pipelining a control signal.

Cache Implementation in Pipelined Processor

I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.
I have sucessfully implemented a 2-way set associative cache.
Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).
But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.
Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.
To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.
But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.
Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??
P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.
I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.
But is this workaround legit??
No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.
A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)
A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.
If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.
You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.
MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.
But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)
Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.
MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks #PaulClayton for catching that.

How do the nodes in a CUDA graph connect?

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe "meta-kernel".
What's unclear to me is how the data flow works for these graphs. In the capture phase, I allocate memory A for the input, memory B for the temporary values, and memory C for the output. But when I capture this in a graph, I don't capture the memory allocations. So when I then instantiate multiple copies of these graphs, they cannot share the input memory A, temporary workspace B or output memory C.
How then does this work? I.e. when I call cudaGraphLaunch, I don't see a way to provide input parameters. My captured graph basically starts with a cudaMemcpyHostToDevice, how does the graph know which host memory to copy and where to put it?
Background: I found that CUDA is heavily bottlenecked on kernel launches; my AVX2 code was 13x times slower when ported to CUDA. The kernels themselves seem fine (according to NSight), it's just the overhead of scheduling several hundred thousand kernel launches.
A memory allocation would typically be done outside of a graph definition/instantiation or "capture".
However, graphs provide for "memory copy" nodes, where you would typically do cudaMemcpy type operations.
At the time of graph definition, you pass a set of arguments for each graph node (which will depend on the node type, e.g. arguments for the cudaMemcpy operation, if it is a memory copy node, or kernel arguments if it is a kernel node). These arguments determine the actual memory allocations that will be used when that graph is executed.
If you wanted to use a different set of allocations, one method would be to instantiate another graph with different arguments for the nodes where there are changes. This could be done by repeating the entire process, or by starting with an existing graph, making changes to node arguments, and then instantiating a graph with those changes.
Currently, in cuda graphs, it is not possible to perform runtime binding (i.e. at the point of graph "launch") of node arguments to a particular graph/node. It's possible that new features may be introduced in future releases, of course.
Note that there is a CUDA sample code called simpleCudaGraphs available in CUDA 10 which demonstrates the use of both memory copy nodes, and kernel nodes, and also how to create dependencies (effectively execution dependencies) between nodes.

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Cuda optim for maxwell

I am trying to understand the parallel forall post on instruction level profiling. And especially the following lines in section Reducing Memory Dependency Stalls:
NVIDIA GPUs do not have indexed register files, so if a stack array is accessed with dynamic indices, the compiler must allocate the array in local memory. In the Maxwell architecture, local memory stores are not cached in L1 and hence the latency of local memory loads after stores is significant.
I understand what register files are but what does it mean that they are not indexed? And why does it prevent the compiler to store a stack array accessed with dynamic indices?
The quote says that the array will be stored in local memory. What block does this local memory correspond to in the architecture below?
... what does it mean that they are not indexed
It means that indirect addressing of registers is not supported. So it isn't possible to index from one register (theoretically the register holding the first element of an array) to another arbitrary register. As a result the compiler can't generate code for non static indexing of an array stored in registers.
What block does this local memory correspond to in the architecture below?
It doesn't correspond to any of them. Local memory is stored in DRAM, not on the GPU itself.