Why and how is __attribute__ used in GNU C programs?
For what GCC and GCC-compatible compilers use __attribute__ most other compilers use #pragma directives.
I think GCC's solution is better since the required behavior of an unrecognised #pragma is to ignore it, whereas if you use a compiler that does not understand an __attribute__ specification, it will not compile - which is generally better, since you then know what you need to port.
Attribute specifications are used to specify aspects of types, data, and functions such as storage and alignment that cannot be specified using C. Often these are target specific, mostly they are non-portable, certainly between compilers, and often between targets. Avoid their use except where it is absolutely necessary to use correct functions of code.
One use is for enforcing memory alignment on variables and structure members. For example
float vect[4] __attribute__((aligned(16)));
Will ensure that vect will be placed on a 16 byte memory boundary. I do not know if that is a gcc-ism or more generally applicable.
The compiler will typically only aligned vect on a 4 byte boundary. With 16 byte alignment it can be used directly with SIMD load instructions where you'd load it up into a 128 bit registers that allows addition, subtraction, dot products and all manner of vector operations.
Sometimes you want alignment so that a structure can be directly overlaid onto memory-mapped hardware registers. Or it has to be aligned so the hardware can write into it directly used a direct memory access (DMA) mechanism.
Why is it used in C programs? To limit their portability.
It begins with a double-underscore, so it's in the implementor's namespace - it's not something defined by the language standard, and each compiler vendor is free to use it for any purpose whatsoever.
Edit: Why is it used in GNU C programs? See the other answers that address this.
Related
I'm curious as to why we are not allowed to use registers as offsets in MIPS. I know that you can't use registers as offsets like this: lw $t3, $t1($t4); I'm just curious as to why that is the case.
Is it a hardware restriction? Or simply just part of the ISA?
PS: if you're looking for what to do instead, see Load Word in MIPS, using register instead of immediate offset from another register or look at compiler output for a C function like int foo(int *arr, int idx){ return arr[idx]; } - https://godbolt.org/z/PhxG57ox1
I'm curious as to why we are not allowed to use registers as offsets in MIPS.
I'm not sure if you mean "why does MIPS assembly not permit you to write it this form" or "why does the underlying ISA not offer this form".
If it's the former, then the answer is that the base ISA doesn't have any machine instructions that offers that functionality, and apparently the designers didn't decide to offer any pseudo-instruction that would implement that behind the scenes.2
If you're asking why the ISA doesn't offer it in the first place, it's just a design choice. By offering fewer or simpler addressing modes, you get the following advantages:
Less room is needed to encode a more limited set of possibilities, so you save encoding space for more opcodes, shorter instructions, etc.
The hardware can be simpler, or faster. For example, allowing two registers in address calculation may result in:
The need for an additional read port in the register file1.
Additional connections between the register file and the AGU to get both registers values there.
The need to do a full width (32 or 64 bit) addition rather than a simpler address-side + 16 bit-addition for the offset.
The need to have a three-input ALU if you want to still want to support immediate offsets with the 2-register addresses (and they are less useful if you don't).
Additional complexity in instruction decoding and address-generation since you may need to support two quite different paths for address generation.
Of course, all of those trade-offs may very well pay off in some contexts that could make good use of 2-reg addressing with smaller or faster code, but the original design which was heavily inspired by the RISC philosophy didn't include it. As Peter points out in the comments, new addressing modes have been subsequently added for some cases, although apparently not a general 2-reg addressing mode for load or store.
Is it a hardware restriction? Or simply just part of the ISA?
There's a bit of a false dichotomy there. Certainly it's not a hardware restriction in the sense that hardware could certainly support this, even when MIPS was designed. It sort of seems to imply that some existing hardware had that restriction and so the MIPS ISA somehow inherited it. I would suspect it was much the other way around: the ISA was defined this way, based on analysis of how likely hardware would be implemented, and then it became a hardware simplification since MIPS hardware doesn't need to support anything outside of what's in the MIPS ISA.
1 E.g., to support store instructions which would need to read from 3 registers.
2 It's certainly worth asking whether such a pseudo-instruction is a good idea or not: it would probably expand to an add of the two registers to a temporary register and then a lw with the result. There is always a danger that this hides "too much" work. Since this partly glosses over the difference between a true load that maps 1:1 to a hardware load, and the version that is doing extra arithmetic behind the covers, it is easy to imagine it might lead to sup-optimal decisions.
Take the classic example of linearly accessing two arrays of equal element size in a loop. With 2-reg addressing, it is natural to write this loop as two 2-reg accesses (each with a different base register and a common offset register). The only "overhead" for the offset maintenance is the single offset increment. This hides the fact that internally there are two hidden adds required to support the addressing mode: it would have simply been better to increment each base directly and not use the offset. Furthermore, once the overhead is clear, you can see that unrolling the loop and using immediate offsets can further reduce the overhead.
I have seen that some people suggest that using signbit() can eliminate warp divergence and improve performance. If this is correct, then how is it implemented in the GPU? Is there some dedicated hardware for this function in, e.g., special function units (SFU)?
The implementation of signbit() is in the open in CUDA versions up to, and including, CUDA 6.5. It can be found in the header file math_functions.h. For newer versions of CUDA, you could inspect the machine code with cubobjdump --dump-sass to see how it is implemented.
Looking at the header file in CUDA 6.5, one sees that signbit() is a macro that maps to an inline function that extracts the sign bit from the raw bit representation for the floating-point operand. On GPUs this is easily doable since integer and floating-point operands share the same register file. In case of CUDA 6.5, the sign bit is extracted with a single right-shift instruction.
So the implementation of signbit() is branchless and efficient, however there is no dedicated hardware instruction for it, as this is unnecessary.
In general, CUDA programmer's do not need to worry about branches all that often, especialy when if-then-else constructs with small bodies are concerned. The compiler frequently renders these into branchless code using either predication of select-type instructions (the machine equivalent of C/C++ ternary operator). It may also combine uniform branches with predication.
I have a large program that uses all the registers I allocated per thread (64) and spills to local memory. I would like to be able to tell the compiler which variables should remain in registers at all cost, and which ones I don't really care about. Does the "register" C/C++ keyword work in nvcc? Is there a different mechanism perhaps?
Thanks!
You can use register in CUDA C/C++ if you want to. In any context, it is only a hint to the compiler. It may be ignored. There is no stated guarantee that it does anything at all.
I think these statements are pretty much true for most language implementations of register.
I also think it's quite likely that the compiler can do a better job than you can of deciding what should be in registers, and appropriate priority.
The typical CUDA C/C++ mechanisms for controlling register usage work at a higher level, they are:
the -maxrregcount compile switch
the launch bounds directive.
I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !
You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.
Is it possible to "link" the STL to an assembly program, e.g. similar to linking the glibc to use functions like strlen, etc.? Specifically, I want to write an assembly function which takes as an argument a std::vector and will be part of a lib. If this is possible, is there any documentation on this?
Any use of C++ templates will require the compiler to generate instantiations of those templates. So you don't really "link" something like the STL into a program; the compiler generates object code based upon your use of templates in the library.
However, if you can write some C++ code that forces the templates to be instantiated for whatever types and other arguments you need to use, then write some C-linkage functions to wrap the uses of those template instantiations, then you should be able to call those from your assembly code.
I strongly believe you're doing it wrong. Using assembler is not going to speed up your handling of the data. If you must use existing assembly code, simply pass raw buffers
std::vector is by definition (in the standard) compatible with raw buffers (arrays); the standard mandates contiguous allocation. Only reallocation can invalidate the memory region that contains the element data. In short, if the C++ code can know the (max) capacity required and reserve()/resize() appropriately, you can pass &vector[0] as the buffer address and be perfectly happy.
If the assembly code needs to decide how (much) to reallocate, let it use malloc. Once done, you should be able to use that array as STL container:
std::accumulate(buf, buf+n, 0, &dosomething);
Alternatively, you can use the fact that std::tr1::array<T, n> or boost::array<T, n> are POD, and use placement new right on the buffer allocated in the library (see here: placement new + array +alignment or How to make tr1::array allocate aligned memory?)
Side note
I have the suspicion that you are using assembly for the wrong reasons. Optimizing compilers will leverage the full potential of modern processors (including SIMD such as SSE1-4);
E.g. for gcc have a look at
__attibute__ (e.g. for pointer restrictions
such as alignment and aliasing guarantees: this will enable the more powerful vectorization options for the compiler);
-ftree_vectorize and -ftree_vectorizer_verbose=2, -march=native
Note also that since the compiler can't be sure what registers an external (or even inline) assembly procedure clobbers, it must assume all registers are clobbered leading to potential performance degradation. See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for ways to use inline assembly with proper hints to gcc.
probably completely off-topic: -fopenmp and gnu::parallel
Bonus: the following references on (premature) optimization in assembly and c++ might come in handy:
Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms
Optimizing subroutines in assembly language: An optimization guide for x86 platforms
The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
And some other relevant resources