What kind of optimizations AVM2 supports? - actionscript-3

I wonder, what kind of optimizations AVM2 (ActionScript 3 VM) support? I know it uses JIT but does it support Dead Code Elimination, constant folding, inlining, etc.
Also it's very interesting to me that ActionScript compiler also do some optimizations. AFAIK C# compiler does very small set of optimizations (only required for language support), JIT does all the work. And it works very fast.
Thanks.
Thanks to MPD. AVM2 supports:
Constant Folding
Copy & Constant Propagation
Common Subexpression Elimination (CSE)
Dead Code Elimination (DCE)

Take a look at these slides: ActionScript 3.0 and AVM2: Performance Tuning.

I don't think that the Flash/Flex compiler do most of this optimizations, but you can achieve this results with 3rd party softwares, like secureSWF (commercial).
Maybe you can find another tool that is free or Open Source that does this too.

Related

Normal Cuda Vs CuBLAS?

Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.

Slatec + CUDA Fortran

I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !
You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.

Is there any advantage of building bytecode than regular actionscript?

Im very curious , is there any advantage of building SWF using bytecode than regular actionscript ?
As i read there are some ways to speed up code a little , but are there some things that as3code cannot do ?
EDIT:
Please , do not focus on coding style , problems , type checks and syntax , let say ill have external SWF with 1 class written in bytecode and now i like to know what benefits i can get from lower level coding.
The Flash compiler doesn't use all the opcodes available in the Flash player so, in theory, you can indeed get some performance increase by "manually" writing opcodes.
For example, the Haxe compiler can make use of the Alchemy OpCodes and provide a boost of performance compared to Flash:
Access Alchemy OpCodes
There are OpCodes for memory allocation hidden in the SWF player which are used by Adobe Alchemy. Haxe has the ability to access them giving you low level memory access which can allow HUGE speed increases.
http://haxe.org/doc/why
I don't know how safe it is to use these opcodes though. Since they are not supported by the official Flash compiler, they might be dropped in a future release of the Flash Player.
Just adding to what Laurent posted. ASC is the ActionScript compiler used by Adobe products so far (MXML uses the code from ASC and the compiler in Flash CS has the customized build of it, but neither greatly affects the actual compiler, but act as the front end / "glue" to the linker and other utilities).
ASC is not an optimizing compiler. What it means is that it doesn't do any optimizations usually possible when compiling to lower level language. It analyzes the generated bytecode only to the level to make mostly certain it's not erroneous. (It is still possible that the compiler will generate an erroneous code from valid AS3 code).
There isn't one to one correspondence between all valid code you can write in bytecode and AS3. Which means that AS3 limits you to a subset of what is possible in bytecode. It is absolutely possible that in the resulting bytecode you would see how it is easier to get a certain value, when it is on stack, but there will be no tool in AS3 to get it from there. For example, you could've avoided creating a loop iterator relying on that the first register contains the iterator, if you know that the rest of the code inside the loop will never read/write to the first register. Obviously there are a lot more "features" you can discover when you analyze the way bytecode is processed.
But, it is important to understand that locally optimizing you will hardly achieve anything significant unless your goal is fine grained and very specific. Such as, for example, some particular cryptographic algorithm or string parsing routine etc. Optimizing by hand larger pieces of program is really difficult. In fact, so difficult that you will need a tool to at least verify yourself that you are actually optimizing. In the end, you will find yourself using this tool to generate the optimized variants of code and test them - and this is how compilers are built :)

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.

Is there any self-improving compiler around?

I am not aware of any self-improving compiler, but then again I am not much of a compiler-guy.
Is there ANY self-improving compiler out there?
Please note that I am talking about a compiler that improves itself - not a compiler that improves the code it compiles.
Any pointers appreciated!
Side-note: in case you're wondering why I am asking have a look at this post. Even if I agree with most of the arguments I am not too sure about the following:
We have programs that can improve
their code without human input now —
they’re called compilers.
... hence my question.
While it is true that compilers can improve code without human interference, however, the claim that "compilers are self-improving" is rather dubious. These "improvements" that compilers make are merely based on a set of rules that are written by humans (cyborgs anyone?). So the answer to your question is : No.
On a side note, if there was anything like a self improving compiler, we'd know... first the thing would improve the language, then its own code and finally, it would modify its code to become a virus and make all developers use it... and then finally we'd have one of those classic computer-versus-humans-last-hope-for-humanity kind of things... so ... No.
MilepostGCC is a MachineLearning compiler, which improve itself with time in the sense that it is able to change itself in order to become "better" with time. A simpler iterative compilation approach is able to improve pretty much any compiler.
25 years of programming and I have never heard of such a thing (unless you're talking about compilers that auto download software updates!).
Not yet practically implemented, to my knowledge, but yes, the theory is there:
Goedel machines: self-referential universal problem solvers making provably optimal self- improvements.
A self improving compiler would, by definition, have to have self modifying code. If you look around, you can find examples of people doing this (self modifying code). However, it's very uncommon to see - especially on projects as large and complex as a compiler. And it's uncommon for the very good reason that it's ridiculously hard (ie, close to impossible) to guarantee correct functionality. A lot of coders who think they're smart (especially Assembly coders) play around with this at one point or another. The ones who actually are smart mostly move out of this phase. ;)
In some situations, a C compiler is run several times without any human input, getting a "better" compiler each time.
Fortunately (or unfortunately, from another point of view) this process plateaus after a few steps -- further iterations generate exactly the same compiler executable as the last.
We have all the GCC source code, but the only C compiler on this machine available is not GCC.
Alas, parts of GCC use "extensions" that can only be built with GCC.
Fortunately, this machine does have a functional "make" executable, and some random proprietary C compiler.
The human goes to the directory with the GCC source, and manually types "make".
The make utility finds the MAKEFILE, which directs it to run the (proprietary) C compiler to compile GCC, and use the "-D" option so that all the parts of GCC that use "extensions" are #ifdef'ed out. (Those bits of code may be necessary to compile some programs, but not the next stage of GCC.). This produces a very limited cut-down binary executable, that barely has enough functionality to compile GCC (and the people who write GCC code carefully avoid using functionality that this cut-down binary does not support).
The make utility runs that cut-down binary executable with the appropriate option so that all the parts of GCC are compiled in, resulting in a fully-functional (but relatively slow) binary executable.
The make utility runs the fully-functional binary executable on the GCC source code with all the optimization options turned on, resulting in the actual GCC executable that people will use from now on, and installs it in the appropriate location.
The make utility tests to make sure everything is working OK: it runs the GCC executable from the standard location on the GCC source code with all the optimization options turned on. It then compares the resulting binary executable with the GCC executable in the standard location, and confirms that they are identical (with the possible exception of irrelevant timestamps).
After the human types "make", the whole process runs automatically, each stage generating an improved compiler (until it plateaus and generates an identical compiler).
http://gcc.gnu.org/wiki/Top-Level_Bootstrap and http://gcc.gnu.org/install/build.html and Compile GCC with Code Sourcery have a few more details.
I've seen other compilers that have many more stages in this process -- but they all require some human input after each stage or two. Example: "Bootstrapping a simple compiler from nothing" by Edmund Grimley Evans 2001
http://homepage.ntlworld.com/edmund.grimley-evans/bcompiler.html
And there is all the historical work done by the programmers who have worked on GCC, who use previous versions of GCC to compile and test their speculative ideas on possibly improved versions of GCC. While I wouldn't say this is without any human input, the trend seems to be that compilers do more and more "work" per human keystroke.
I'm not sure if it qualifies, but the Java HotSpot compiler improves code at runtime using statistics.
But at compile time? How will that compiler know what's deficient and what's not? What's the measure of goodness?
There are plenty of examples of people using genetic techniques to pit algorithms against each other to come up with "better" solutions. But these are usually well-understood problems that have a metric.
So what metrics could we apply at compile time? Minimum size of compiled code, cyclometric complexity, or something else? Which of these is meaningful at runtime?
Well, there is JIT (just in time) techniques. One could argue that a compiler with some JIT optimizations might re-adjust itself to be more efficient with the program it is compiling ???