I am revisiting assembly, and there are many, many more instructions on x86 than my previous work with SPARCs. In short, I'm going to write some function that are assembly-inline to handle some speedups that the compiler is just doing poorly (there's no overflow bit in C). Anyway, for this reason, I am interested in the types of instructions that are generally and the addressing modes in a static context.
Is there a better tool than objdump for this? I hacked together objdump and awk, but I feel there's a better way.
Related
Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.
I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.
I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.
I am not aware of any self-improving compiler, but then again I am not much of a compiler-guy.
Is there ANY self-improving compiler out there?
Please note that I am talking about a compiler that improves itself - not a compiler that improves the code it compiles.
Any pointers appreciated!
Side-note: in case you're wondering why I am asking have a look at this post. Even if I agree with most of the arguments I am not too sure about the following:
We have programs that can improve
their code without human input now —
they’re called compilers.
... hence my question.
While it is true that compilers can improve code without human interference, however, the claim that "compilers are self-improving" is rather dubious. These "improvements" that compilers make are merely based on a set of rules that are written by humans (cyborgs anyone?). So the answer to your question is : No.
On a side note, if there was anything like a self improving compiler, we'd know... first the thing would improve the language, then its own code and finally, it would modify its code to become a virus and make all developers use it... and then finally we'd have one of those classic computer-versus-humans-last-hope-for-humanity kind of things... so ... No.
MilepostGCC is a MachineLearning compiler, which improve itself with time in the sense that it is able to change itself in order to become "better" with time. A simpler iterative compilation approach is able to improve pretty much any compiler.
25 years of programming and I have never heard of such a thing (unless you're talking about compilers that auto download software updates!).
Not yet practically implemented, to my knowledge, but yes, the theory is there:
Goedel machines: self-referential universal problem solvers making provably optimal self- improvements.
A self improving compiler would, by definition, have to have self modifying code. If you look around, you can find examples of people doing this (self modifying code). However, it's very uncommon to see - especially on projects as large and complex as a compiler. And it's uncommon for the very good reason that it's ridiculously hard (ie, close to impossible) to guarantee correct functionality. A lot of coders who think they're smart (especially Assembly coders) play around with this at one point or another. The ones who actually are smart mostly move out of this phase. ;)
In some situations, a C compiler is run several times without any human input, getting a "better" compiler each time.
Fortunately (or unfortunately, from another point of view) this process plateaus after a few steps -- further iterations generate exactly the same compiler executable as the last.
We have all the GCC source code, but the only C compiler on this machine available is not GCC.
Alas, parts of GCC use "extensions" that can only be built with GCC.
Fortunately, this machine does have a functional "make" executable, and some random proprietary C compiler.
The human goes to the directory with the GCC source, and manually types "make".
The make utility finds the MAKEFILE, which directs it to run the (proprietary) C compiler to compile GCC, and use the "-D" option so that all the parts of GCC that use "extensions" are #ifdef'ed out. (Those bits of code may be necessary to compile some programs, but not the next stage of GCC.). This produces a very limited cut-down binary executable, that barely has enough functionality to compile GCC (and the people who write GCC code carefully avoid using functionality that this cut-down binary does not support).
The make utility runs that cut-down binary executable with the appropriate option so that all the parts of GCC are compiled in, resulting in a fully-functional (but relatively slow) binary executable.
The make utility runs the fully-functional binary executable on the GCC source code with all the optimization options turned on, resulting in the actual GCC executable that people will use from now on, and installs it in the appropriate location.
The make utility tests to make sure everything is working OK: it runs the GCC executable from the standard location on the GCC source code with all the optimization options turned on. It then compares the resulting binary executable with the GCC executable in the standard location, and confirms that they are identical (with the possible exception of irrelevant timestamps).
After the human types "make", the whole process runs automatically, each stage generating an improved compiler (until it plateaus and generates an identical compiler).
http://gcc.gnu.org/wiki/Top-Level_Bootstrap and http://gcc.gnu.org/install/build.html and Compile GCC with Code Sourcery have a few more details.
I've seen other compilers that have many more stages in this process -- but they all require some human input after each stage or two. Example: "Bootstrapping a simple compiler from nothing" by Edmund Grimley Evans 2001
http://homepage.ntlworld.com/edmund.grimley-evans/bcompiler.html
And there is all the historical work done by the programmers who have worked on GCC, who use previous versions of GCC to compile and test their speculative ideas on possibly improved versions of GCC. While I wouldn't say this is without any human input, the trend seems to be that compilers do more and more "work" per human keystroke.
I'm not sure if it qualifies, but the Java HotSpot compiler improves code at runtime using statistics.
But at compile time? How will that compiler know what's deficient and what's not? What's the measure of goodness?
There are plenty of examples of people using genetic techniques to pit algorithms against each other to come up with "better" solutions. But these are usually well-understood problems that have a metric.
So what metrics could we apply at compile time? Minimum size of compiled code, cyclometric complexity, or something else? Which of these is meaningful at runtime?
Well, there is JIT (just in time) techniques. One could argue that a compiler with some JIT optimizations might re-adjust itself to be more efficient with the program it is compiling ???
What considerations do I need to make if I want my code to run correctly on both 32bit and 64bit platforms ?
EDIT: What kind of areas do I need to take care in, e.g. printing strings/characters or using structures ?
Options:
Code it in some language with a Virtual Machine (such as Java)
Code it in .NET and don't target any specific architecture. The .NET JIT compiler will compile it for you to the right architecture before running it.
One solution would be to target a virtual environment that runs on both platforms (I'm thinking Java, or .Net here).
Or pick an interpreted language.
Do you have other requirements, such as calling existing code or libraries?
The same things you should have been doing all along to ensure you write portable code :)
mozilla guidelines and the C faq are good starting points
I assume you are still talking about compiling them separately for each individual platform? As running them on both is completely doable by just creating a 32bit binary.
The biggest one is making sure you don't put pointers into 32-bit storage locations.
But there's no proper 'language-agnostic' answer to this question, really. You couldn't even get a particularly firm answer if you restricted yourself to something like standard 'C' or 'C++' - the size of data storage, pointers, etc, is all terribly implementation dependant.
It honestly depends on the language, because managed languages like C# and Java or Scripting languages like JavaScript, Python, or PHP are locked in to their current methodology and to get started and to do anything beyond the advanced stuff there is not much to worry about.
But my guess is that you are asking about languages like C++, C, and other lower level languages.
The biggest thing you have to worry about is the size of things, because in the 32-bit world you are limited to the power of 2^32 however in the 64-bit world things get bigger 2^64.
With 64-bit you have a larger space for memory and storage in RAM, and you can compute larger numbers. However if you know you are compiling for both 32 and 64, you need to make sure to limit your expectations of the system to the 32-bit world and limitations of buffers and numbers.
In C (and maybe C++) always remember to use the sizeof operator when calculating buffer sizes for malloc. This way you will write more portable code anyway, and this will automatically take 64bit datatypes into account.
In most cases the only thing you have to do is just compile your code for both platforms. (And that's assuming that you're using a compiled language; if it's not, then you probably don't need to worry about anything.)
The only thing I can think of that might cause problems is assuming the size of data types, which is something you probably shouldn't be doing anyway. And of course anything written in assembly is going to cause problems.
Keep in mind that many compilers choose the size of integer based on the underlying architecture, given that the "int" should be the fastest number manipulator in the system (according to some theories).
This is why so many programmers use typedefs for their most portable programs - if you want your code to work on everything from 8 bit processors up to 64 bit processors you need to recognize that, in C anyway, int is not rigidly defined.
Pointers are another area to be careful - don't use a long, or long long, or any specific type if you are fiddling with the numeric value of the pointer - use the proper construct, which, unfortunately, varies from compiler to compiler (which is why you have a separate typedef.h file for each compiler you use).
-Adam Davis