CUDA intrinsic function advantage

CUDA intrinsic function advantage - cuda

Similar to this question, is there any advantage of using the intrinsics (single, double or half) in the CUDA Math API. I understand some have faster (less accurate) versions such as __fdivdef and these can always be used with -use_fast_math, however what about the other functions. For example, why would one use __fadd_rd(A,B) instead of A+B or __fmaf_rd(A,B,C) instead of A+B+C? One reason I can think of is that one can choose the rounding method more conveniently - fine.
Also some functions, for example __fmul_rd "will never be merged into a single multiply-add instruction" (according to the CUDA Math API documentation). Why would this be advantageous?

The really short answer is that using something like __fmul_rd is never "advantageous", but sometimes use floating point instructions with clearly defined and fully predictable (or standardised) rounding and compilation behaviour is required to make calculations work correctly. This, for example.
The general rule is that if you don't understand why those floating point intrinsic functions exist, you shouldn't use them.

Intrinsics give you finer control over exactly what operations your inner loop is going to do. If I call __fmaf_rd, I am virtually certain that the emitted PTX is going to have an fma.rd instruction without having to resort to writing inline assembly code.
I will therefore have no worries that the compiler might optimize the loop differently than I want*, or that there might be some subtlety of the standards I'm overlooking that requires the compiler to implement something more complicated than I thought I wrote.
Naturally, this is only a good motivation if I really know what I'm doing in this regard, but if I do, it's there for me to use. And being an intrinsic is superior to inline assembly, because the compiler actually understands the instruction.
*: You can't understand how frustrating it is when you know the best way to implement the loop, but the compiler keeps "optimizing" to something less efficient

Related

Regarding (When Executed) in Haskell IO Monad

I have no problem with the IO Monad. But I want to understand the followings:
In All/almost Haskell tutorials/ text books they keep saying that getChar is not a pure function, because it can give you a different result. My question is: Who said that this is a function in the first place. Unless you give me the implementation of this function, and I study that implementation, I can't guarantee it is pure. So, where is that implementation?
In All/almost Haskell tutorials/ text books, it's said that, say (IO String) is an action that (When executed) it can give you back a value of type String. This is fine, but who/where this execution is taking place. Of course! The computer is doing this execution. This is OK too. but since I am only a beginner, I hope you forgive me to ask, where is the recipe for this "execution". I would guess it is not written in Haskell. Does this later idea mean that, after all, that a Haskell program is converted into a C-like program, which will eventually be converted into Assembly -> Machine code? If so, where one can find the implementation of the IO stuff in Haskell?
Many thanks

Haskell functions are not the same as computations.
A computation is a piece of imperative code (perhaps written in C or Assembler, and then compiled to machine code, directly executable on a processor), that is by nature effectful and even unrestricted in its effects. That is, once it is ran, a computation may access and alter any memory and perform any operations, such as interacting with keyboard and screen, or even launching missiles.
By contrast, a function in a pure language, such as Haskell, is unable to alter arbitrary memory and launch missiles. It can only alter its own personal section of memory and return a result that is specified in its type.
So, in a sense, Haskell is a language that cannot do anything. Haskell is useless. This was a major problem during the 1990's, until IO was integrated into Haskell.
Now, an IO a value is a link to a separately prepared computation that will, eventually, hopefully, produce a. You will not be able to create an IO a out of pure Haskell functions. All the IO primitives are designed separately, and packaged into GHC. You can then compose these simple computations into less trivial ones, and eventually your program may have any effects you may wish.
One point, though: pure functions are separate from each other, they can only influence each other if you use them together. Computations, on the other hand, may interact with each other freely (as I said, they can generally do anything), and therefore can (and do) accidentally break each other. That's why there are so many bugs in software written in imperative languages! So, in Haskell, computations are kept in IO.
I hope this dispels at least some of your confusion.

Proving equivalence of programs

The ultimate in optimizing compilers would be one that searched among the space of programs for a program equivalent to the original but faster. This has been done in practice for very small basic blocks: https://en.wikipedia.org/wiki/Superoptimization
It sounds like the hard part is the exponential nature of the search space, but actually it's not; the hard part is, supposing you find what you're looking for, how do you prove that the new, faster program is really equivalent to the original?
Last time I looked into it, some progress had been made on proving certain properties of programs in certain contexts, particularly at a very small scale when you are talking about scalar variables or small fixed bit vectors, but not really on proving equivalence of programs at a larger scale when you are talking about complex data structures.
Has anyone figured out a way to do this yet, even 'modulo solving this NP-hard search problem that we don't know how to solve yet'?
Edit: Yes, we all know about the halting problem. It's defined in terms of the general case. Humans are an existence proof that this can be done for many practical cases of interest.

You're asking a fairly broad question, but let me see if I can get you going.
John Regehr does a really nice job surveying some relevant papers on superoptimizers: https://blog.regehr.org/archives/923
The thing is you don't really need to prove whole program equivalence for these types of optimizations. Instead you just need to prove that given the CPU is in a particular state, 2 sequences of code modify the CPU state in the same way. To prove this across many optimizations (i.e. at scale), typically you might first throw some random inputs at both sequences. If they're not equivalent bits of code then you might get lucky and very quickly show this (proof by contradiction) and you can move on. If you haven't found a contradiction, you can now try to prove equivalence via a computationally expensive SAT solver. (As an aside, the STOKE paper that Regehr mentions is particularly interesting if you're interested in superoptimizers.)
Now looking at whole program semantic equivalence, one approach here is the one used by the CompCert compiler. Essentially that compiler is proving this theorem:
If CompCert is able to translate C code X into assembly code Y then X and Y are semantically equivalent.
In addition CompCert does apply a few compiler optimizations and indeed these optimizations are often the areas that traditional compilers get wrong. Perhaps something like CompCert is what you're after in which case, the compiler goes about things via a series of refinement passes where it proves that if each pass succeeds, the results are semantically equivalent to the previous pass.

Normal Cuda Vs CuBLAS?

Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?

We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)

There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.

Slatec + CUDA Fortran

I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !

You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.

Why is memoization not a language feature?

I was wondering: why is memoization not provided natively as a language feature in any language I know about?
Edit: to clarify, what I mean is that the language provides a keyword to specify a given function as memoizable, not that every function is automatically memoized "by default" unless specified otherwise. For example, Fortran provides the keyword PURE to specify a specific function as such. I guess that the compiler can take advantage of this information to memoize the call, but I ignore what happens if you declare PURE a function with side effects.

What YOU want from memoization may not be the same as what the compiler memoization option would provide.
You may know that it is only profitable to memoize the last 10 or so distinct values computed, because you know how the function will be used.
You may know that it only makes sense to memoize the last 2 or 3 values, because you will never use values older than that. (Fibonacci's Sequence comes to mind.)
You may be generating a LOT of values on some runs, and just a few on others.
You may want to "throw away" some of the memoized values and start over. (I memoized a random number generator this way, so I could replay the sequence of random numbers that built a certain structure, while some other parameters of the structure had been changed.)
Memoization as an optimization depends on the search for the memoized value being a lot cheaper than recomputation of the value. This in turn depends on the ordering of the input requests. This has implications for the memoization database: Does it use a stack, an array of all possible input values (which may be very large), a bucket hash, or a b-tree?
The memoizing compiler has to either provide a "one size fits all" memoization, or it has to provide lots of possible alternatives, and parameters to control the alternatives. At some point, it becomes easier for everyone to require the user to provide his own memoization.

Because compilers have to emit semantically correct programs. You can't memoize a function without changing program semantics unless it is referentially transparent. In most programming languages not all functions are referentially transparent (pure functional programming languages are an exception) so you can't memoize everything. But then a mechanism is needed for detecting referential transparency and that is too hard.

In Haskell, memoization is automatic for (pure) functions you've defined that take no arguments. And the Fibonacci example in that Wiki is really about the simplest demonstrable example I would be able to think of either.
Haskell can do this because your pure functions are defined to produce the same results every time; of course, monadic functions that depend on side effects won't be memoized.
I'm not sure what the upper limits are -- obviously, it won't memoize more than the available memory. And I'm also not sure offhand if the memoization occurs at compile-time (if the values can be determined at compile-time), or if it always occurs the first time the function is called.

Clojure has a memoize function (http://richhickey.github.com/clojure/clojure.core-api.html#clojure.core/memoize):
memoize
function
Usage: (memoize f)
Returns a memoized version of a referentially transparent function. The
memoized version of the function keeps a cache of the mapping from arguments
to results and, when calls with the same arguments are repeated often, has
higher performance at the expense of higher memory use.

A) Memoization trades space for time. I imagine that this can turn out to a fairly unbound property, in the sense, that the amount of data programs or libraries would have to store could consume large parts of memory really quick.
For a couple of languages, memoization is easy to implement and easy to customize for the given requirements.
As an example take some natural language processing on large bodies of text, where you don't want to compute basic properties of texts (word count, frequency, cooccurrences, ...) over and over again. In that case a memoization in combination with object serialization can be useful as opposed to memory caching, since you may run your application multiple times on unchanged corpora.
B) Another aspect: It's not true, that all functions or methods yield the same output for a same given input. Anyway some keyword or syntax for memoization would be necessary, along with configuration (memory limits, invalidation policy, ...) ...

Because you shouldn't implement something as a language feature when it can easily be implemented in the language itself. A memoization feature belongs in a library, which is exactly where most languages put it.

Your question also leaves open the solution of your learning more languages. I think that Lisp supports memoization, and I know that Mathematica does.

In order for memoization to work as a language feature there would be a couple requirements.
The compiler would need to be identify valid functions for memoization (e.g. they are referentially transparent).
The run-time would have to be able to intelligently select candidates for memoization without slowing down the overall performance.
There are some assumptions in the other language, but if we can have performance gains by just-in-time compilation of hot-spots in a Java VM, then one can surely write an automated memoziation system.
While non-trivial I think this is all theoretically possible to get performance gains in a language (especially an interpreted one) and is a worthwhile area for research.

Not all the languages natively support function decorators. I guess it would be a more general approach to support rather than supporting just memoization.

Reverse the question. Why it should? As someone has said, it can be put in a library so no need of add syntax to the language, it's only usable on pure functions which are hard to identify automatically(unless you force the programmer to annotate them). It's also very hard to determine if memoization is going to speed up things or not. I don't think it's a desirable feature for a programming language.

I really think such an option should be.
In data processing tasks there is an immutable input data (as time series, for example, where for a given time as soon as a value is known, it can never change). Taking in mind today RAM affordability, if a function result only depends on such immutable data, it is rational to memoize it rather than reread every time it's needed. Currently I have (in Scala and C#) to manually introduce an in-memory storage table and write 3 functions instead of one - one reading a value from file/db/ws, one storing it into an in-memory table, one to wrap them and read from memory if available or call the raw function if not. I think this could and should be implemented as a keyword and done behind the scenes.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008