Learning AVA.js test runner
It is not clear how can I mock global objects (e.g. Date, Math etc.) as long as the tests run in parallel so such object patching becomes concurrent.
How should one really go with it?
Related
CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.
So, I have a theoretical question about the Chisel code transformation.
I already know that the Chisel code is compiled to Java bytecodes, it then runs in the JVM and it emits equivalent Verilog and C++ source codes (for older versions of Chisel).
But I'm having a lot of trouble in understanding that process.
For instance, in the Chisel source code, I can see that there is a Reg class, for example, that creates a definition of a register. I can then import and use this class in the design of the hardware. But I cannot understand where the separation between the description of the Reg class itself and the actual usage of it lies. It's so confusing.
For example, suppose I'm developing a project that USES a Reg object, where there's a source code called whatever.scala, and inside this source code there are Reg objects. As I understand it, the description of the register itself (the Reg.scala) and the source code that uses it (whatever.scala) are all compiled at the same time, and that's precisely the point a cannot get.
To make it short, in my point of view, there is a separation between describing a library, and actually using this library after it was built. You must first compile the library, then you import it into your project and use it. But in Chisel, these two steps seem to happen at the same time.
Is there any intermediate process between the JVM code emission and the creation of the Chisel AST?
Chisel is a high level highly parameterized embedded DSL for generating hardware design.
A chisel program typically consists of several steps:
A chisel3 program first constructs an internal representations of an idealized circuit as an abstact syntax tree (AST). At the end of generation, the AST is serialized in to FIRRTL (an intermediate representation) representation. See: chisel3
The firrtl transformation engine process the high level FIRRTL produced with some number of transformation passes. These passes can optimize the code, do width inferences, and finally emit verilog or low firrtl. See: firrtl
Typically during development the circuit is then unit tested. There are two simple ways to do this.
The verilog emitted can be converted into an executable simulation via verilator and a c++ compiler. The simulation can be executed with a test harness that validates the circuit. See: chisel-testers
Or, the emitted firrtl can be simulated using the firrtl-interpreter a lightweight scala program, capable of running the same unit tests used with the chisel-testers. See: firrtl-interpreter
These steps can be run together, using chisel-tester can execute all the above steps automatically. Or done individually, each step can produce output files for the user to add custom integration or to target the verilog for FPGA or a chip tape-out.
The JVM is simply the execution environment used to run scala programs and is not necessary to understand or interact with in order to build circuits using Chisel.
To address the Chisel vs. your project question:
Chisel is a Scala library that is compiled to JVM bytecode. A project that uses Chisel is a Scala program that links against Chisel. This project is also compiled to JVM bytecode, but includes calls to the separately compiled Chisel library*. This project using Chisel is then executed, running on the JVM. The execution of this program constructs a hardware AST that is ultimately emitted as Verilog.
* Many projects (like rocket-chip) do include the Chisel source code as a subproject. Chisel is usually compiled first and then linked against. However, it should make no difference if it were compiled all at once--it's just Scala code that other Scala code invokes.
I currently have a couple of tests which really run very long. Inside each test I do always the same:
there is a loop which creates a new object (every iteration with different parameters), does some time consuming calculations with the object and at the end of each iteration compares the result to the expected result.
Every iteration in this loop is completely isolated. I could easily run all those 200 very time consuming iterations in parallel. But how best to do this?
Cheers,
AvH
Junit 4 has inbuilt parellel processing. Check this documentation.
Apart from that you may need consider moving all the duplicate iterations in to a static setup method and annotate as #BeforeClass. That will make sure code runs only once in the entire lifecycle.
#BeforeClass
public static void setup() {
//Move anything needs to run only once.
}
You have to create an own modification of the Parameterized runner. See http://jankesterblog.blogspot.de/2011/10/junit4-running-parallel-junit-classes.html
The library JUnit Toolbox provides a ParallelParameterized runner.
This is a homework question, obviously. I'm trying to pipeline a simple, 5 stage (IF,ID,EX,MEM,WB), single-cycle MIPS processor in VHDL. I don't need to implement forwarding or hazard detection for it. I'm just unsure of what components I need to implement.
Is it necessary to create D Flip-Flops for each signal?
The pipeline implementation here uses a for-loop for the outputs - is that something I should do?
Any tips would be much appreciated, I can't seem to find much relevant information on pipelining in VHDL.
What you probably want to do is create a separate entity for each stage of your pipeline and then connect the output of one stage to the input of the other.
To make sure things are pipelined correctly, you just need to make sure that each stage only does whatever processing it needs to do on the rising edge.
If you want an example, take a look at this project of mine. Specifically at the files dft_top.vhd and dft_stage[1-3].vhd. It implements a 16-point 16-bit fixed point DFT in pipelined stages.
I need to test a program that first preprocesses some data and then computes several different results using this preprocessed data -- it makes sense to write separate test for each computation.
Official JUnit policy seems to be that I should run preprocessing before each computation test.
How can I set up my test so that I could run the preparation only once (it's quite slow) before running remaining tests?
Use the annotation #BeforeClass to annotate the method which will be run once before all test methods.