This is a homework question, obviously. I'm trying to pipeline a simple, 5 stage (IF,ID,EX,MEM,WB), single-cycle MIPS processor in VHDL. I don't need to implement forwarding or hazard detection for it. I'm just unsure of what components I need to implement.
Is it necessary to create D Flip-Flops for each signal?
The pipeline implementation here uses a for-loop for the outputs - is that something I should do?
Any tips would be much appreciated, I can't seem to find much relevant information on pipelining in VHDL.
What you probably want to do is create a separate entity for each stage of your pipeline and then connect the output of one stage to the input of the other.
To make sure things are pipelined correctly, you just need to make sure that each stage only does whatever processing it needs to do on the rising edge.
If you want an example, take a look at this project of mine. Specifically at the files dft_top.vhd and dft_stage[1-3].vhd. It implements a 16-point 16-bit fixed point DFT in pipelined stages.
Related
I'm working with an STM32F469 chip with a Micron MT25Q Quad_SPI Flash. To program the Flash, there needs to be an external loader program developed. That's all working, but the problem is that verification of the QSPI Flash is extremely slow.
Looking in the log file, it shows that the Flash is being programmed in 150K byte blocks. However, the verification is being done in 1K byte blocks. In addition, the chip is re-initialized before each block check. I've tried this with both through STM32cubeIDE and in STM32cubeProgrammer directly.
The external programmer program include the correct chip configuration information and specifies a 64K page size. I don't see how to get the programmer to use a larger block size. It looks like it understands what part of the SRAM is used and is using the balance of the 256K in the on-board SRAM for programming the QSPI Flash. It could use the same size for reading the data back or use the Verify() function in the external loader. It's calling Read() and then checking the data itself.
Any thought or hints?
Let me add some observations on creating a new external loader. The first observation is "Don't." If you can pick a supported external chip and pin it out to use an existing loader, then do that. STM provides just 4 example programs but they must have 50 external loaders. If the hardware design copies the schematic for a demo board that has an external loader, you should be fine and avoid doing the development work.
The external loader is not a complete executable. It provides a set of functions to do basic operations like Init(), Erase(), Read() and Write(). The trick is that there is no main() and no start-up code is run when the program starts.
The external loader is an ELF file, renamed to "*.stldr". The programming tool looks into the debug information to find the location of the functions. It then sets the registers to provide the parameters, the PC to run the function, and then let's it run. There's some super-clever work going on to make this work. The programmer looks at the returned value (R0) to see if things pass or not. It can also figure out if the function has crashed the core or otherwise timed-out.
What makes writing the external super fun is that the debugger is running the program so there's no debugger available to see what the code is doing. I settled on outputting errors, and encoded information, on the return() from the called functions to give hints as to what was happening.
The external loader isn't a "full" program. Without the startup code, lots of on-chip stuff isn't set up and some just isn't going to work. At least I couldn't figure it out. I'm not sure if it wasn't configured right or the debugger was blocking its use. Looking at the example external loaders, they are written in a very simple way and do not call the HAL or use interrupts. You'll need to provide core set-up functions to configure the clock chains. That Hal_Delay() method will never return as the timers and/or interrupts aren't working. I could never make them work and suspect the NVIC was somehow being disabled. I ended up replacing the HAL_delay() function with a for loop that spun based on core clock rate and a the instruction cycles per loop.
The app note suggests developing a stand-alone program to debug the basic capabilities. That's a good idea but a challenge. Prior to starting the external loader, I had the QSPI doing the needed operations but from a C++ application calling the HAL. Creating an external loader from that was a long exercise in stripping out and replacing functionality. A hint is that the examples are written at a register level. I'm not that good to deal directly with the QuadSPI peripheral and the chip's instruction set at the same time.
The normal start-up of a program is eliminated. Everything that's done before the main() is called (E.g., in startup_stm32f469nihx.s) is up to you. This includes setting the clock chains to boost the core clock and get the peripheral buses working. The program runs in the on-chip SRAM so any initialized variables are loaded correctly. There's no moving data needed but the stack and uninitialized data areas could/should still be zeroed.
I hope this helps someone!
Today I've faced the same issue.
I was able to improve the Verify speed with two simple steps, but Verify still much slower than programming and this is strange...
If anyone find a way to change the 1KB block read of STM32CubeProgrammer I would like to know =).
Follow the changes I made to improve a bit the performance.
Add a kind of lock in the Init Function to avoid multiple initializations. This was the most significant change because I'am checking the Flash ID in my initialization proccess. Other aproachs could be safer but this simple code snippet worked to me.
int Init(void)
{
static uint32_t lock;
if(lock != 0x43213CA5)
{
lock = 0x43213CA5;
/* Init procedure goes here */
}
return(1);
}
Cache a page instead of reading the external memory for each call. This will help more if your external memory page read has too much overhead, otherwise this idea won't give relevant results.
CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.
Im very curious , is there any advantage of building SWF using bytecode than regular actionscript ?
As i read there are some ways to speed up code a little , but are there some things that as3code cannot do ?
EDIT:
Please , do not focus on coding style , problems , type checks and syntax , let say ill have external SWF with 1 class written in bytecode and now i like to know what benefits i can get from lower level coding.
The Flash compiler doesn't use all the opcodes available in the Flash player so, in theory, you can indeed get some performance increase by "manually" writing opcodes.
For example, the Haxe compiler can make use of the Alchemy OpCodes and provide a boost of performance compared to Flash:
Access Alchemy OpCodes
There are OpCodes for memory allocation hidden in the SWF player which are used by Adobe Alchemy. Haxe has the ability to access them giving you low level memory access which can allow HUGE speed increases.
http://haxe.org/doc/why
I don't know how safe it is to use these opcodes though. Since they are not supported by the official Flash compiler, they might be dropped in a future release of the Flash Player.
Just adding to what Laurent posted. ASC is the ActionScript compiler used by Adobe products so far (MXML uses the code from ASC and the compiler in Flash CS has the customized build of it, but neither greatly affects the actual compiler, but act as the front end / "glue" to the linker and other utilities).
ASC is not an optimizing compiler. What it means is that it doesn't do any optimizations usually possible when compiling to lower level language. It analyzes the generated bytecode only to the level to make mostly certain it's not erroneous. (It is still possible that the compiler will generate an erroneous code from valid AS3 code).
There isn't one to one correspondence between all valid code you can write in bytecode and AS3. Which means that AS3 limits you to a subset of what is possible in bytecode. It is absolutely possible that in the resulting bytecode you would see how it is easier to get a certain value, when it is on stack, but there will be no tool in AS3 to get it from there. For example, you could've avoided creating a loop iterator relying on that the first register contains the iterator, if you know that the rest of the code inside the loop will never read/write to the first register. Obviously there are a lot more "features" you can discover when you analyze the way bytecode is processed.
But, it is important to understand that locally optimizing you will hardly achieve anything significant unless your goal is fine grained and very specific. Such as, for example, some particular cryptographic algorithm or string parsing routine etc. Optimizing by hand larger pieces of program is really difficult. In fact, so difficult that you will need a tool to at least verify yourself that you are actually optimizing. In the end, you will find yourself using this tool to generate the optimized variants of code and test them - and this is how compilers are built :)
So, I think I have a very weird question.
So, let say that I already have a program put on my GPU and in that program I call a function X. But that function X is not declared yet.
I want to be able, dynamically, to modify that function X, by completely changing the code and put it in the program without recompiling the rest or losing any pointers whatsoever.
To compare it with something that most of us know, I want to be able to do like the shaders in OpenGL. In the middle of the execution, I can change the code of one shader, only recompile that shader, activate the program and now I used this one.
So, is it possible. Or do I need to recompile the whole thing all the time. And if I have to recompile, do I lose the various arrays that I created in global memory ?
Thanks
W
If you compile with the -cuda flag using nvcc, you can get the intermediate C++ source that streams PTX to the processor. In theory, you could post-process this intermediate output to dynamically generate PTX on the fly and send it over. You might even be able to have PTX be self modifying, but that's way out of my league.
I've recently started a project, building a physics engine.
I was hoping you could give me some advice related to some documentation and/or best technologies for this.
First of all, I've seen that Game-Physics-Engine-Development is highly recommended for the task at hand, and I was wondering if you could give me a second opinion.Should I get it?
Also, while browsing Amazon, I've stumbled onto Game Engine Architecture and since I want to build my physics engine for games, I thought this might be a good read aswell.
Second, I know that simulating physics is highly computation intensive so I would like to use either CUDA or OpenCL.Right now I'm leaning towards OpenCL, because it would work on both NVIDIA and ATI chipsets.What do you guys suggest?
PS: I will be implementing this in C++ on Linux.
Thanks.
I would suggest first of all planning a simple game as a test case for your engine. Having a basic game will drive feature and API development. Writing an engine without having clear goal makes the project riskier. While I agree nVidia and ATi should be treated as separate targets for performance reasons, I'd recommend you start with neither.
I personally wrote physics engine for Uncharted:Drake's Fortune - a PS3 game - and I did a pass in C++, and when it worked, made a pass to optimize it for VMX and then put it on SPU. Mind you, I did just a fraction what I wanted to do initially because of time constraints. After that I made an iteration to split data stages out and formulate a pipeline of data transformations. It's important because whether CPU, GPU or SPU, modern processors running nontrivial code spend most of the time waiting for caches. You have to pay special attention to data structures and pipeline them such that you have a small working set of data at any stage. E.g. first I do broadphase, so I don't need shapes but I need world-space bounding boxes. So I split bounding boxes into separate array and compute them all together in another pass, that writes them out in an optimal way. As input to bbox computation, I need shape transformations and some bounds from them, but not necessarily the whole shapes. After broadphase, I compute/update sim islands, at the same time performing narrow phase, for which I do actually need the shapes. And so on - I described this with pictures in an article to Game Physics Pearls I wrote.
I guess what I'm trying to say are the following points:
Make sure you have a clear goal that drives your development - a very basic game with flushed out design would be best in game physics engine case.
Don't try to optimize before you have a working product. Write it in the simplest and fastest way possible first and fix all the bugs in math. Design it so that you can port it to CUDA later, but don't start writing CUDA kernels before you have boxes rolling on the screen.
After you write the first pass in C++, optimize it for CPU : streamline it such that it doesn't thrash the cache, and compartmentalize the code so that there's no spaghetti of calls and all the code from each stage is localized. This will help a) port to CUDA b) port to OpenCL c) port to a console d) make it run reasonably fast e) make it possible to debug.
While developing, resist temptation to go do something you just thought about unless that feature is not necessary for your clear goal (see #1) - that's why you need a goal, to steer you towards it and make it possible to finish the actual project. Distractions usually kill projects without clear goals.
Remember that in one way or another, software development is iterative. It's ok to do a rough-in and then refine it. Leather, rinse, repeat - it's a mantra of a programmer :)
It's easy to give advice. If you wanna do something, just go and do it, and we'll sit back and critique :)
Here is an answer regarding the choice of CUDA or OpenCL. I do not have a recommendation for a book.
If you want to run your program on both NVIDIA and ATI chipsets, then OpenCL will make the job easier. However, you will want to write a different version of each kernel to get good performance on each chipset. For example, on ATI cards you'll want to manually vectorize code using float4/int4 data types (or accept a nearly 4x performance penalty), while NVIDIA works better with scalar data types.
If you're only targeting NVIDIA, then CUDA is somewhat more convenient to program in.