What are the advantages of inline? - actionscript-3

I've been reading a bit about how to use the inline metadata when using the ASC 2.0 compiler.
However I can't find any source of info explaining why I should use them.
Anyone knows?.

Functions induce overhead in any programming language. Per ActionScript, when function execution begins, a number of objects and properties are created.
First, an activation object is created that stores the parameters and local variables declared in a function body. It's an internal mechanism that cannot be directly accessed.
Second, a scope chain is created that contains an ordered list of objects that Flash platform checks for identifier declarations. Every function that executes has a scope chain that is stored in an internal property.
Function closures maintain a snapshot of a function and its lexical environment.
Moving code inline reduces the creation of these objects, and how references are maintained on the stack. Per Flash, you may see 4x performance increase.
Of course, there are tradeoffs - without the inline keyword induces code complexity; as well, inlining code increasing the amount of bytecode. Besides larger applications, the virtual machine spends additional time verifying and JIT compiling.

To simplify, inline is some sort of copy/paste of code. Since method calls are expensive and cost execution time, using inline keyword will copy/paste the body of the method each time the method call is present in your code so the method call will be replaced by the body of the method instead. Since this is done at compilation time it will increase in theory the size of the resulting app (if an inline method is called 10 times its body will be copied and pasted 10 times) but since all calls will be replaced you will gain speed execution. This is of course only relevant for demanding code execution like loops running at each frame for example.

Related

External FLASH slow verification with STM32cubeProgrammer

I'm working with an STM32F469 chip with a Micron MT25Q Quad_SPI Flash. To program the Flash, there needs to be an external loader program developed. That's all working, but the problem is that verification of the QSPI Flash is extremely slow.
Looking in the log file, it shows that the Flash is being programmed in 150K byte blocks. However, the verification is being done in 1K byte blocks. In addition, the chip is re-initialized before each block check. I've tried this with both through STM32cubeIDE and in STM32cubeProgrammer directly.
The external programmer program include the correct chip configuration information and specifies a 64K page size. I don't see how to get the programmer to use a larger block size. It looks like it understands what part of the SRAM is used and is using the balance of the 256K in the on-board SRAM for programming the QSPI Flash. It could use the same size for reading the data back or use the Verify() function in the external loader. It's calling Read() and then checking the data itself.
Any thought or hints?
Let me add some observations on creating a new external loader. The first observation is "Don't." If you can pick a supported external chip and pin it out to use an existing loader, then do that. STM provides just 4 example programs but they must have 50 external loaders. If the hardware design copies the schematic for a demo board that has an external loader, you should be fine and avoid doing the development work.
The external loader is not a complete executable. It provides a set of functions to do basic operations like Init(), Erase(), Read() and Write(). The trick is that there is no main() and no start-up code is run when the program starts.
The external loader is an ELF file, renamed to "*.stldr". The programming tool looks into the debug information to find the location of the functions. It then sets the registers to provide the parameters, the PC to run the function, and then let's it run. There's some super-clever work going on to make this work. The programmer looks at the returned value (R0) to see if things pass or not. It can also figure out if the function has crashed the core or otherwise timed-out.
What makes writing the external super fun is that the debugger is running the program so there's no debugger available to see what the code is doing. I settled on outputting errors, and encoded information, on the return() from the called functions to give hints as to what was happening.
The external loader isn't a "full" program. Without the startup code, lots of on-chip stuff isn't set up and some just isn't going to work. At least I couldn't figure it out. I'm not sure if it wasn't configured right or the debugger was blocking its use. Looking at the example external loaders, they are written in a very simple way and do not call the HAL or use interrupts. You'll need to provide core set-up functions to configure the clock chains. That Hal_Delay() method will never return as the timers and/or interrupts aren't working. I could never make them work and suspect the NVIC was somehow being disabled. I ended up replacing the HAL_delay() function with a for loop that spun based on core clock rate and a the instruction cycles per loop.
The app note suggests developing a stand-alone program to debug the basic capabilities. That's a good idea but a challenge. Prior to starting the external loader, I had the QSPI doing the needed operations but from a C++ application calling the HAL. Creating an external loader from that was a long exercise in stripping out and replacing functionality. A hint is that the examples are written at a register level. I'm not that good to deal directly with the QuadSPI peripheral and the chip's instruction set at the same time.
The normal start-up of a program is eliminated. Everything that's done before the main() is called (E.g., in startup_stm32f469nihx.s) is up to you. This includes setting the clock chains to boost the core clock and get the peripheral buses working. The program runs in the on-chip SRAM so any initialized variables are loaded correctly. There's no moving data needed but the stack and uninitialized data areas could/should still be zeroed.
I hope this helps someone!
Today I've faced the same issue.
I was able to improve the Verify speed with two simple steps, but Verify still much slower than programming and this is strange...
If anyone find a way to change the 1KB block read of STM32CubeProgrammer I would like to know =).
Follow the changes I made to improve a bit the performance.
Add a kind of lock in the Init Function to avoid multiple initializations. This was the most significant change because I'am checking the Flash ID in my initialization proccess. Other aproachs could be safer but this simple code snippet worked to me.
int Init(void)
{
static uint32_t lock;
if(lock != 0x43213CA5)
{
lock = 0x43213CA5;
/* Init procedure goes here */
}
return(1);
}
Cache a page instead of reading the external memory for each call. This will help more if your external memory page read has too much overhead, otherwise this idea won't give relevant results.

What is the use of task graphs in CUDA 10?

CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.

are fork.. join statements allowed in functions in system verilog?

they are definitely allowed in tasks, But I could not find, if they are allowed in functions.
Thanks in advance for your help.
Yes, fork...join_none is allowed within functions.
A fork block can only be used in a function if it is matched with a join_none. The reason is that functions must execute in zero time. Because a fork...join_none will be spawned into a separate thread/process, the function can still complete in zero time.
This is clearly stated in IEEE 1800-2012 in section 13.4.4 Background processes spawned by function calls
Functions shall execute with no delay. Thus, a process calling a function shall return immediately. Statements that do not block shall be allowed inside a function; specifically, nonblocking assignments, event triggers, clocking drives, and fork - join_none constructs shall be allowed inside a function.
My simulation tool allows fork...join_none in functions but issues a warning that fork...join (and probably fork...join_any) will be converted to begin...end. I couldn't find anything in the standard about this which most likely is why I don't get a strict compile error.
Be careful as different simulator vendors may implement different rules. In two of the big 3 simulators fork...join_none in functions definitely works. fork...join/join_any doesn't make sense in the context of a function so I would avoid it altogether.

Advantage/Disadvantage of function pointers

So the problem I am having has not actually happened yet. I am planning out some code for a game I am currently working on and I know that I am going to be needing to conserve memory usage as much as possible from step one. My question is, if I have for example, 500k objects that will need to constantly be constructed and deconstructed. Would is save me any memory to have the functions those classes are going to use as function pointers.e.g. without function pointers class MyClass{public:void Update();void Draw();...}e.g. with function pointersclass MyClass{public:void * Update;void * Draw;...}Would this save me any memory or would any new creation of MyClass just access the same functions that were defined for the rest of them? If it does save me any memory would it be enough to be worthwhile?
Assuming those are not virtual functions, you'd use more memory with function pointers.
The first example
There is no allocation (beyond the base amount required to make new return unique pointers, or your additional implementation that you ... ellipsized).
Non-virtual member functions are basically static functions taking a this pointer.
The advantage is that your objects are really simple, and you'll have only one place to look to find the corresponding code.
The disadvantage is that you lose some flexibility, and have to cram all your code into single update/draw functions. This will be hard to manage for anything but a tiny game.
The second example
You allocate 2x pointers per object instance. Pointers are usually 4x to 8x bytes each (depending on your target platform and your compiler).
The advantage you gain a lot of flexibility of being able to change the function you're pointing to at runtime, and can have a multitude of functions that implement it - thus supporting (but not guaranteeing) better code organization.
The disadvantage is that it will be harder to tell which function each instance will point to when you're debugging your application, or when you're simply reading through the code.
Other options
Using function pointers this specific way (instance data members) usually makes more sense in plain C, and less sense in C++.
If you want those functions to be bound at runtime, you may want to make them virtual instead. The cost is compiler/implementation dependent, but I believe it is (usually) going to be one v-table pointer per object instance.
Plus you can take full advantage of virtual function syntax to make your functions based on type, rather than whatever you bound them to. It will be easier to debug than the function pointer option, since you can simply look at the derived type to figure out what function a particular instance points to.
You also won't have to initialize those function pointers - the C++ type system would do the equivalent initialization automatically (by building the v-table, and initializing each instance's v-table pointer).
See: http://www.parashift.com/c++-faq-lite/virtual-functions.html

What is the difference between message-passing and method-invocation?

Is there a difference between message-passing and method-invocation, or can they be considered equivalent? This is probably specific to the language; many languages don't support message-passing (though all the ones I can think of support methods) and the ones that do can have entirely different implementations. Also, there are big differences in method-invocation depending on the language (C vs. Java vs Lisp vs your favorite language). I believe this is language-agnostic. What can you do with a passed-method that you can't do with an invoked-method, and vice-versa (in your favorite language)?
Using Objective-C as an example of messages and Java for methods, the major difference is that when you pass messages, the Object decides how it wants to handle that message (usually results in an instance method in the Object being called).
In Java however, method invocation is a more static thing, because you must have a reference to an Object of the type you are calling the method on, and a method with the same name and type signature must exist in that type, or the compiler will complain. What is interesting is the actual call is dynamic, although this is not obvious to the programmer.
For example, consider a class such as
class MyClass {
void doSomething() {}
}
class AnotherClass {
void someMethod() {
Object object = new Object();
object.doSomething(); // compiler checks and complains that Object contains no such method.
// However, through an explicit cast, you can calm the compiler down,
// even though your program will crash at runtime
((MyClass) object).doSomething(); // syntactically valid, yet incorrect
}
}
In Objective-C however, the compiler simply issues you a warning for passing a message to an Object that it thinks the Object may not understand, but ignoring it doesn't stop your program from executing.
While this is very powerful and flexible, it can result in hard-to-find bugs when used incorrectly because of stack corruption.
Adapted from the article here.
Also see this article for more information.
as a first approximation, the answer is: none, as long as you "behave normally"
Even though many people think there is - technically, it is usually the same: a cached lookup of a piece of code to be executed for a particular named-operation (at least for the normal case). Calling the name of the operation a "message" or a "virtual-method" does not make a difference.
BUT: the Actor language is really different: in having active objects (every object has an implicit message-queue and a worker thread - at least conceptionally), parallel processing becones easier to handle (google also "communicating sequential processes" for more).
BUT: in Smalltalk, it is possible to wrap objects to make them actor-like, without actually changing the compiler, the syntax or even recompiling.
BUT: in Smalltalk, when you try to send a message which is not understoof by the receiver (i.e. "someObject foo:arg"), a message-object is created, containing the name and the arguments, and that message-object is passed as argument to the "doesNotUnderstand" message. Thus, an object can decide itself how to deal with unimplemented message-sends (aka calls of an unimplemented method). It can - of course - push them into a queue for a worker process to sequentialize them...
Of course, this is impossible with statically typed languages (unless you make very heavy use of reflection), but is actually a VERY useful feature. Proxy objects, code load on demand, remote procedure calls, learning and self-modifying code, adapting and self-optimizing programs, corba and dcom wrappers, worker queues are all built upon that scheme. It can be misused, and lead to runtime bugs - of course.
So it it is a two-sided sword. Sharp and powerful, but dangerous in the hand of beginners...
EDIT: I am writing about language implementations here (as in Java vs. Smalltalk - not inter-process mechanisms.
IIRC, they've been formally proven to be equivalent. It doesn't take a whole lot of thinking to at least indicate that they should be. About all it takes is ignoring, for a moment, the direct equivalence of the called address with an actual spot in memory, and consider it simply as a number. From this viewpoint, the number is simply an abstract identifier that uniquely identifies a particular type of functionality you wish to invoke.
Even when you are invoking functions in the same machine, there's no real requirement that the called address directly specify the physical (or even virtual) address of the called function. For example, although almost nobody ever really uses them, Intel protected mode task gates allow a call to be made directly to the task gate itself. In this case, only the segment part of the address is treated as an actual address -- i.e., any call to a task gate segment ends up invoking the same address, regardless of the specified offset. If so desired, the processing code can examine the specified offset, and use it to decide upon an individual method to be invoked -- but the relationship between the specified offset and the address of the invoked function can be entirely arbitrary.
A member function call is simply a type of message passing that provides (or at least facilitates) an optimization under the common circumstance that the client and server of the service in question share a common address space. The 1:1 correspondence between the abstract service identifier and the address at which the provider of that service reside allows a trivial, exceptionally fast, mapping from one to the other.
At the same time, make no mistake about it: the fact that something looks like a member function call doesn't prevent it from actually executing on another machine or asynchronously, or (frequently) both. The typical mechanism to accomplish this is proxy function that translates the "virtual message" of a member function call into a "real message" that can (for example) be transmitted over a network as needed (e.g., Microsoft's DCOM, and CORBA both do this quite routinely).
They really aren't the same thing in practice. Message passing is a way to transfer data and instructions between two or more parallel processes. Method invocation is a way to call a subroutine. Erlang's concurrency is built on the former concept with its Concurrent Oriented Programing.
Message passing most likely involves a form of method invocation, but method invocation doesn't necessarily involve message passing. If it did it would be message passing. Message passing is one form of performing synchronization between to parallel processes. Method invocation generally means synchronous activities. The caller waits for the method to finish before it can continue. Message passing is a form of a coroutine. Method-invocation is a form of subroutine.
All subroutines are coroutines, but all coroutines are not subroutines.
Is there a difference between message-passing and method-invocation, or can they be considered equivalent?
They're similar. Some differences:
Messages can be passed synchronously or asynchronously (e.g. the difference between SendMessage and PostMessage in Windows)
You might send a message without knowing exactly which remote object you're sending it to
The target object might be on a remote machine or O/S.