Is there any documentation on NVCC compiler optimization heuristics? - cuda

I'm looking for detailed documentation about NVCC compiler choices to optimize the code. But so far I haven't been able to find anything interesting in Nvidia documents nor in literature.

No, there is no documentation, official or otherwise, describing the compiler internals.
You might be able to infer some basics if you:
You can look at the official documentation at,
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation
In particular, you need to be aware of the HW details of the GPU you are targeting, including registers, memory, if you decide to play around with the flags.
The instruction set reference would be helpful, and can be found at,
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref
Unofficially, lot of analysis has been going on in academia, for ex, for characterising latencies,
https://www.groundai.com/project/instructions-latencies-characterization-for-nvidia-gpgpus/1
cuobjdump utility can be useful for analysing the generated code, as discussed in (sass is also described there)
https://forums.developer.nvidia.com/t/solved-sass-code-analysis/41167

Related

Getting the instructions and addressing modes from a binary?

I am revisiting assembly, and there are many, many more instructions on x86 than my previous work with SPARCs. In short, I'm going to write some function that are assembly-inline to handle some speedups that the compiler is just doing poorly (there's no overflow bit in C). Anyway, for this reason, I am interested in the types of instructions that are generally and the addressing modes in a static context.
Is there a better tool than objdump for this? I hacked together objdump and awk, but I feel there's a better way.

cuGraph on Multi-GPU

Recently, I am reading the code of cuGraph. I notice that it is mentioned that Louvain and Katz algorithms support multi-GPU. However, when I read the C++ code of Louvain, I cannot find code that is related to multi-GPU. Specifically, according to a prior post, multi-GPU can be implemented by calling cudaSetDevice. I cannot find this function in the code of Louvain, however. Am I missing anything?
cuGraph supports multi-GPU by leveraging Dask. I encourage you to read the Dask cuGraph documentation that shows an example using PageRank.
For a Louvain example, I recommend looking at the docstring of the cugraph.dask.louvain function.
For completeness, under the hood cuGraph is using RAFT to manage underlying NCCL and UCX communication.

How do I develop CUDA application on my ATI, to be later executed on NVIDIA

My computer has an ATI graphics card, but I need to code an algorithm I already have in CUDA, to accerelate the process. Is that even possible? If yes does anyone have any link or tutorial from setting up my IDE to coding a simple image processing or passing an image. I also considered OpenCL but I have not found any information how to do anything with it.
This answer is more directed toward the part
I also considered OpenCL but I have not found any information how to do anything with it.
Check on this NVIDIA site:
http://developer.nvidia.com/nvidia-gpu-computing-documentation
Scroll down and you find
OpenCL Programming Guide
This is a detailed programming guide for OpenCL developers.
OpenCL Best Practices Guide
This is a manual to help developers obtain the best performance from OpenCL.
OpenCL Overview for the CUDA Architecture
This whitepaper summarizes the guidelines for how to choose the best implementations for NVIDIA GPUs.
OpenCL Implementation Notes
This document describes the "Implementation Defined" behavior for the NVIDIA OpenCL implementation as required by the OpenCL specification Version: 1.0. The implementation defined behavior is referenced below in the order of it's reference in the OpenCL specification and is grouped by the section number for the specification.
On AMD/ATI you have this site for a brief introduction:
http://www.amd.com/us/products/technologies/stream-technology/opencl/pages/opencl-intro.aspx
And for more resources check:
http://www.amd.com/us/products/technologies/stream-technology/Pages/training-resources.aspx
Unless CUDA is a requirement you should consider OpenCL again as you can you use it on both platforms and you state have one and want to develop for the other.
You might also want to take a look at these:
http://blogs.nvidia.com/2011/06/cuda-now-available-for-multiple-x86-processors/
http://www.pgroup.com/resources/cuda-x86.htm
I haven't tried it myself, but the prospect of running CUDA code on x86 seems pretty attractive.

how to use my existing .cpp code with cuda

I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.

Is there any self-improving compiler around?

I am not aware of any self-improving compiler, but then again I am not much of a compiler-guy.
Is there ANY self-improving compiler out there?
Please note that I am talking about a compiler that improves itself - not a compiler that improves the code it compiles.
Any pointers appreciated!
Side-note: in case you're wondering why I am asking have a look at this post. Even if I agree with most of the arguments I am not too sure about the following:
We have programs that can improve
their code without human input now —
they’re called compilers.
... hence my question.
While it is true that compilers can improve code without human interference, however, the claim that "compilers are self-improving" is rather dubious. These "improvements" that compilers make are merely based on a set of rules that are written by humans (cyborgs anyone?). So the answer to your question is : No.
On a side note, if there was anything like a self improving compiler, we'd know... first the thing would improve the language, then its own code and finally, it would modify its code to become a virus and make all developers use it... and then finally we'd have one of those classic computer-versus-humans-last-hope-for-humanity kind of things... so ... No.
MilepostGCC is a MachineLearning compiler, which improve itself with time in the sense that it is able to change itself in order to become "better" with time. A simpler iterative compilation approach is able to improve pretty much any compiler.
25 years of programming and I have never heard of such a thing (unless you're talking about compilers that auto download software updates!).
Not yet practically implemented, to my knowledge, but yes, the theory is there:
Goedel machines: self-referential universal problem solvers making provably optimal self- improvements.
A self improving compiler would, by definition, have to have self modifying code. If you look around, you can find examples of people doing this (self modifying code). However, it's very uncommon to see - especially on projects as large and complex as a compiler. And it's uncommon for the very good reason that it's ridiculously hard (ie, close to impossible) to guarantee correct functionality. A lot of coders who think they're smart (especially Assembly coders) play around with this at one point or another. The ones who actually are smart mostly move out of this phase. ;)
In some situations, a C compiler is run several times without any human input, getting a "better" compiler each time.
Fortunately (or unfortunately, from another point of view) this process plateaus after a few steps -- further iterations generate exactly the same compiler executable as the last.
We have all the GCC source code, but the only C compiler on this machine available is not GCC.
Alas, parts of GCC use "extensions" that can only be built with GCC.
Fortunately, this machine does have a functional "make" executable, and some random proprietary C compiler.
The human goes to the directory with the GCC source, and manually types "make".
The make utility finds the MAKEFILE, which directs it to run the (proprietary) C compiler to compile GCC, and use the "-D" option so that all the parts of GCC that use "extensions" are #ifdef'ed out. (Those bits of code may be necessary to compile some programs, but not the next stage of GCC.). This produces a very limited cut-down binary executable, that barely has enough functionality to compile GCC (and the people who write GCC code carefully avoid using functionality that this cut-down binary does not support).
The make utility runs that cut-down binary executable with the appropriate option so that all the parts of GCC are compiled in, resulting in a fully-functional (but relatively slow) binary executable.
The make utility runs the fully-functional binary executable on the GCC source code with all the optimization options turned on, resulting in the actual GCC executable that people will use from now on, and installs it in the appropriate location.
The make utility tests to make sure everything is working OK: it runs the GCC executable from the standard location on the GCC source code with all the optimization options turned on. It then compares the resulting binary executable with the GCC executable in the standard location, and confirms that they are identical (with the possible exception of irrelevant timestamps).
After the human types "make", the whole process runs automatically, each stage generating an improved compiler (until it plateaus and generates an identical compiler).
http://gcc.gnu.org/wiki/Top-Level_Bootstrap and http://gcc.gnu.org/install/build.html and Compile GCC with Code Sourcery have a few more details.
I've seen other compilers that have many more stages in this process -- but they all require some human input after each stage or two. Example: "Bootstrapping a simple compiler from nothing" by Edmund Grimley Evans 2001
http://homepage.ntlworld.com/edmund.grimley-evans/bcompiler.html
And there is all the historical work done by the programmers who have worked on GCC, who use previous versions of GCC to compile and test their speculative ideas on possibly improved versions of GCC. While I wouldn't say this is without any human input, the trend seems to be that compilers do more and more "work" per human keystroke.
I'm not sure if it qualifies, but the Java HotSpot compiler improves code at runtime using statistics.
But at compile time? How will that compiler know what's deficient and what's not? What's the measure of goodness?
There are plenty of examples of people using genetic techniques to pit algorithms against each other to come up with "better" solutions. But these are usually well-understood problems that have a metric.
So what metrics could we apply at compile time? Minimum size of compiled code, cyclometric complexity, or something else? Which of these is meaningful at runtime?
Well, there is JIT (just in time) techniques. One could argue that a compiler with some JIT optimizations might re-adjust itself to be more efficient with the program it is compiling ???