Here is my best guess, but it doesn't look like the generated behavioral Verilog will result in a simple transparent latch when synthesized:
// DXP Latch
val dxp = config(2) & config(0)
val latch = Reg( lut.io.out )
val out = Mux( dxp, latch, lut.io.out )
I appreciate your ideas on this.
Chisel does not support latches. Reg() will only generate edge-triggered state elements.
If you really want latches, you would have to modify the backend of Chisel to understand a new Latch() construct and generate the appropriate Verilog. However, this will take you down a long rabbit hole of difficulties, the first of which is you would probably be throwing away the synchronous, edge-triggered timing model (that allows things like the C++ emulator to work).
In our experiences, any critical applications that needed some of the properties of latches will get automatically handled by the synthesis tools (like time-borrowing).
Related
I'm finding when generating Verilog output from the Chisel framework, all of the 'structure' defined in the chisel framework is lost at the interface.
This is problematic for instantiating this work in larger SystemVerilog designs.
Are there any extensions or features in Chisel to support this better? For example, automatically converting Chisel "Bundle" objects into SystemVerilog 'struct' ports.
Or creating SV enums, when the Chisel code is written using the Enum class.
Currently, no. However, both suggestions sound like very good candidates for discussion for future implementation in Chisel/FIRRTL.
SystemVerilog Struct Generation
Most Chisel code instantiated inside Verilog/SystemVerilog will use some interface wrapper that deals with converting the necessary signal names that the instantiator wants to use into Chisel-friendly names. As one example of doing this see AcceleratorWrapper. That instantiates a specific accelerator and does the connections to the Verilog names the instantiator expects. You can't currently do this with SystemVerilog structs, but you could accomplish the same thing with a SystemVerilog wrapper that maps the SystemVerilog structs to deterministic Chisel names. This is the same type of problem/solution that most people encounter/solve when integrating external IP in their project.
Kludges aside, what you're talking about is possible in the future...
Some explanation is necessary as to why this is complex:
Chisel is converted to FIRRTL. FIRRTL is then lowered to a reduced subset of FIRRTL called "low" FIRRTL. Low FIRRTL is then mapped to Verilog. Part of this lowering process flattens all bundles using uniquely determined names (typically a.b.c will lower to a_b_c but will be uniquified if a namespace conflict due to the lowering would result). Verilog has no support for structs, so this has to happen. Additionally, and more critically, some optimizations happen at the Low FIRRTL level like Constant Propagation and Dead Code Elimination that are easier to write and handle there.
However, SystemVerilog or some other language that a FIRRTL backend is targeting that supports non-flat types benefits from using the features of that language to produce more human-readable output. There are two general approaches for rectifying this:
Lowered types retain information about how they were originally constructed via annotations and the SystemVerilog emitter reconstructs those. This seems inelegant due to lowering and then un-lowering.
The SystemVerilog emitter uses a different sequence of FIRRTL transforms that does not go all the way to Low FIRRTL. This would require some of the optimizing transforms run on Low FIRRTL to be rewritten to work on higher forms. This is tractable, but hard.
If you want some more information on what passes are run during each compiler phase, take a look at LoweringCompilers.scala
Enumerated Types
What you mention for Enum is planned for the Verilog backend. The idea here was to have Enums emit annotations describing what they are. The Verilog emitter would then generate localparams. The preliminary work for annotation generation was added as part of StrongEnum (chisel3#885/chisel3#892), but the annotations portion had to be later backed out. A solution to this is actively being worked on. A subsequent PR to FIRRTL will then augment the Verilog emitter to use these. So, look for this going forward.
On Contributions and Outreach
For questions like this with (currently) negative answers, feel free to file an issue on the respective Chisel3 or FIRRTL repository. And even better than that is an RFC followed by an implementation.
Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API
The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:
// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);
// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);
// When done, destroy the handle
cublasDestroy(handle);
However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:
The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.
My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.
For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.
However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".
What I have figured out so far is sketched here.
(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)
1. The "Handle" is basically a class/struct containing the following information:
class Handle
{
CUcontext context;
boolean usingPrimaryContext;
CUdevice device;
}
2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:
Handle createHandle()
{
cuInit(0);
// Obtain the current context
CUcontext context;
cuCtxGetCurrent(&context);
CUdevice device;
// If there is no context, use the primary context
boolean usingPrimaryContext = false;
if (context == nullptr)
{
usingPrimaryContext = true;
// Obtain the device that is currently selected via the runtime API
int deviceIndex;
cudaGetDevice(&deviceIndex);
// Obtain the device and its primary context
cuDeviceGet(&device, deviceIndex);
cuDevicePrimaryCtxRetain(&context, device));
cuCtxSetCurrent(context);
}
else
{
cuCtxGetDevice(device);
}
// Create the actual handle. This might internally allocate
// memory or do other things that are specific for the context
// for which the handle is created
Handle handle = new Handle(device, context, usingPrimaryContext);
return handle;
}
3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:
void someLibraryFunction(Handle handle)
{
cuCtxSetCurrent(handle.context);
callMyKernel(...);
}
Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.
4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:
void destroyHandle(Handle handle)
{
if (handle.usingPrimaryContext)
{
cuDevicePrimaryCtxRelease(handle.device);
}
}
From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.
So my questions are:
Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)
(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)
I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.
I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.
The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.
Now, to answer your concrete question:
Pattern for writing a class surrounding a raw "handle"
Are there any established patterns for implementing such a "Handle"?
Yes. If you read:
What is the difference between: Handle, Pointer and Reference
you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.
That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.
Generality and flexibility
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would
create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:
Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.
Your class doesn't do this.
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Yes:
Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
To the best of my knowledge, NVIDIA hasn't released it.
Suppose, I want to write a function that tries to find a key in a map and returns None if it cannot: try_find: 'a -> ('a, 'b) Map.t -> 'b option, what is the canonical way to do this? To first check that the key exists with mem and then call find? Or to catch the Not_found exception? Batteries seem to do the latter.
On the other hand, in languages like C# or Java people are usually discouraged from using exceptions in such cases, for performance reasons. Is using exceptions on "normal" execution paths a usual thing in Ocaml or is it also discouraged?
OCaml exceptions are as fast as function calls for the default backend. For Javascript backends, it is not always true. The canonical OCaml way is to implement a function that doesn't throw an exception is to use a throwing function and translate the exception to a nullary variant, e.g.,
let try_find x xs = try Some (List.find x xs) with Not_found -> None
Calling mem and find is a loss of performance, as you will actually iterate the list twice.
There are tradeoffs between raising an exception and returning an option type. The standard function List.find will not allocate any new values in the heap, so no garbage will be created. On the other hand, the try_find function will allocate a new value every time something is found (None is a constant so it is not allocated). This will create an extra work for the garbage collector, that will eventually degrade the performance. To me, the semantic benefits of total functions outweigh possible performance degradation. If the latter does matter (in case of tight loops) then I can always optimize it locally by either using an exception in a very tight context, or continuation passing style and/or GADT.
Is using exceptions on "normal" execution paths a usual thing in Ocaml or is it also discouraged?
It wasn't discouraged by the design of the language, and OCaml standard library uses exceptions a lot. However, the language evolves, and new features are added to the language. Moreover, new backends are implemented, like several Javascript backends, Java, and .Net backends. It is not trivial, to provide the same performance guarantees for these backends. So with a time, the popularity of exceptions reduced, and many people started to favor total functions with explicitly encoded errors, cf., the newly added to the standard library result type. Another example is Janestreet Core library (and all other libraries) that disfavor exceptions and use them only for exceptional cases.
You should decide by yourself an exception policy (or borrow the existing one). My personal policy is trying to avoid them in the public interfaces and sparingly use them very locally. I also use exceptions, for logic and programmer errors, basically, for errors, that shouldn't be captured.
From what I've seen, OCaml exceptions are quite efficient, and I see them being used more often than in other functional languages I guess.
I try to avoid them myself as they interfere with reasoning about the program. But a self-contained use in a library doesn't seem so bad.
The efficiency of low-level things like exceptions is something that might vary a lot from platform to platform. I suspect that catching the Not_found exception would be faster for very large maps, as it avoids traversing the map twice. Otherwise it might not matter much.
This is a homework question, obviously. I'm trying to pipeline a simple, 5 stage (IF,ID,EX,MEM,WB), single-cycle MIPS processor in VHDL. I don't need to implement forwarding or hazard detection for it. I'm just unsure of what components I need to implement.
Is it necessary to create D Flip-Flops for each signal?
The pipeline implementation here uses a for-loop for the outputs - is that something I should do?
Any tips would be much appreciated, I can't seem to find much relevant information on pipelining in VHDL.
What you probably want to do is create a separate entity for each stage of your pipeline and then connect the output of one stage to the input of the other.
To make sure things are pipelined correctly, you just need to make sure that each stage only does whatever processing it needs to do on the rising edge.
If you want an example, take a look at this project of mine. Specifically at the files dft_top.vhd and dft_stage[1-3].vhd. It implements a 16-point 16-bit fixed point DFT in pipelined stages.
Im very curious , is there any advantage of building SWF using bytecode than regular actionscript ?
As i read there are some ways to speed up code a little , but are there some things that as3code cannot do ?
EDIT:
Please , do not focus on coding style , problems , type checks and syntax , let say ill have external SWF with 1 class written in bytecode and now i like to know what benefits i can get from lower level coding.
The Flash compiler doesn't use all the opcodes available in the Flash player so, in theory, you can indeed get some performance increase by "manually" writing opcodes.
For example, the Haxe compiler can make use of the Alchemy OpCodes and provide a boost of performance compared to Flash:
Access Alchemy OpCodes
There are OpCodes for memory allocation hidden in the SWF player which are used by Adobe Alchemy. Haxe has the ability to access them giving you low level memory access which can allow HUGE speed increases.
http://haxe.org/doc/why
I don't know how safe it is to use these opcodes though. Since they are not supported by the official Flash compiler, they might be dropped in a future release of the Flash Player.
Just adding to what Laurent posted. ASC is the ActionScript compiler used by Adobe products so far (MXML uses the code from ASC and the compiler in Flash CS has the customized build of it, but neither greatly affects the actual compiler, but act as the front end / "glue" to the linker and other utilities).
ASC is not an optimizing compiler. What it means is that it doesn't do any optimizations usually possible when compiling to lower level language. It analyzes the generated bytecode only to the level to make mostly certain it's not erroneous. (It is still possible that the compiler will generate an erroneous code from valid AS3 code).
There isn't one to one correspondence between all valid code you can write in bytecode and AS3. Which means that AS3 limits you to a subset of what is possible in bytecode. It is absolutely possible that in the resulting bytecode you would see how it is easier to get a certain value, when it is on stack, but there will be no tool in AS3 to get it from there. For example, you could've avoided creating a loop iterator relying on that the first register contains the iterator, if you know that the rest of the code inside the loop will never read/write to the first register. Obviously there are a lot more "features" you can discover when you analyze the way bytecode is processed.
But, it is important to understand that locally optimizing you will hardly achieve anything significant unless your goal is fine grained and very specific. Such as, for example, some particular cryptographic algorithm or string parsing routine etc. Optimizing by hand larger pieces of program is really difficult. In fact, so difficult that you will need a tool to at least verify yourself that you are actually optimizing. In the end, you will find yourself using this tool to generate the optimized variants of code and test them - and this is how compilers are built :)