Creating Universal binaries for OpenCL Kernel for Intel GPU

Creating Universal binaries for OpenCL Kernel for Intel GPU - binary

We write OpenCL C code and clCreateProgramWithSource and use clGetProgramInfo to get the binary. This binary is then integrated to the product binary which uses clCreateProgramWithBinary when initializing it.
We create a .h file and include the same in the source file. The content of the .h file is the binary generated after compiling OpenCL C Kernel.
The issue with the above step is, the compatibility of the binary is expected to break with any minor/major change in OpenCL and it will most likely break across vendors. We need to generate the OpenCL Kernel binary for each vendor or OpenCL release.
It is possible to integrate the OpenCL Kernel binary in header form to the project. In this case, if the binary is incompatible, we will not be in position to replace the binary. In such cases, the project initialization fails.
Expected Solution
The OpenCL C source is proprietary to the company and cannot be shared with the customers.
Since the OpenCL Kernel binary is integrated with the project
library, we need to understand if it is possible to generate binary
which can re-organize itself while clCreateProgramWithBinary to fit
to the target platform.
If it is absolutely necessary to generate the binary once for each
vendor/OpenCL minor/major revision and store it to disk (which will
be done at end user’s machine), how can we protect the source which
proprietary to the company (is SPIR the only option)?
I already visited Universal binaries for OpenCL but it suggests that SPIR also takes long time in compilation and hence it might not be the solution I am looking for since the init time is also important.

In practice the Intel Gen binary format can change on driver changes for the same platform/hardware (e.g. for bug fix workarounds and performance improvements). Hence, the bits returned by clGetProgramInfo are only sure to work in clCreateProgramWithBinary on the same device x driver x etc... Sadly, this means that the binary path is a poor match for the intellectual property security problem.
SPIR sort of splits the difference as it would be hardware independent while still being harder to reverse engineer. If startup performance is somehow important, you can always try the clCreateProgramWithBinary path; just be able to fall back to SPIR should the binary load fail (meaning the driver changed or something).

Related

What is real difference between Firmware and Embedded Software

I am searching real difference between firmware and embedded software.
On the internet it is written for firmware is firmware is a type of embedded software but not vice versa. In addition to that a classic BIOS example it is very old.
They both run in non-volatile memory. One difference is Embedded software like an application programming that has an rtos and file system and can be run on RAM.
If i dont use rtos and RAM and only uses flash memory it means my embedded software is a firmware, it is true?
What actually makes real difference its memory layout.
The answers on the internet are lack of technical explanations and not satisfied.
Thank you very much.

They are not distinctly separate things, or even well defined. Firmware is a subset of software; the term typically implies that it is in read-only memory:
Software refers to any machine executable code - including "firmware".
Firmware refers to software in read-only memory
Read-only memory in this context includes re-writable memory such as flash or EPROM that requires a specific erase/write operation and is not simply random-access writable.
The distinction between RAM and ROM execution is not really a distinction between firmware and software. Many embedded systems load executable code from ROM and execute from RAM for performance reasons, while others execute directly from ROM. Rather if the end-user cannot easily modify or replace the software without special tools or a bootloader, then it might be regarded as "firm". If on the other hand a normal end-user can modify, update or replace the software using facilities on the system itself (by copying a file from removable media or network for example), then it is not firmware. Consider the difference in operation for example in updating your PC's BIOS and updating Microsoft Office - the former requires a special procedure distinct from normal operating system services for loading and running software.
For example, the operating system, bootloader and BIOS of a smart phone might be considered firmware. The apps a user loads from an app-store are certainly not firmware.
In other contexts "firmware" might refer to the configuration of a programmable logic device such as an FPGA as opposed to sequentially executed processor instructions. But that is rather a niche distinction, but useful in systems employing both programmable logic and software execution.
Ultimately you would use the term "firmware" to imply some level of "permanence" of software in a system, but there is a spectrum, so you would use the term in whatever manner is useful in the context of your particular system. For example, I am working on a system where all the code runs from flash, so only ever use the term software to refer to it because there is no need to distinguish it from any other kind of software in the system.

How is WebGL or CUDA code actually translated into GPU instructions?

When you write shaders and such in WebGL or CUDA, how is that code actually translated into GPU instructions?
I want to learn how you can write super low-level code that optimizes graphic rendering to the extreme, in order to see exactly how GPU instructions are executed, at the hardware/software boundary.
I understand that, for CUDA for example, you buy their graphics card (GPU), which is somehow implemented to optimize graphics operations. But then how do you program on top of that (in a general sense), without C?
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
If you look at docs like CUDA by example, that's all just C code (though they do have things like cudaMalloc and cudaFree, which I don't know what that's doing behind the scenes). But under the hood, that C must be being compiled to assembly or at least machine code or something, right? And if so, how is that accessing the GPU?
Basically I am not seeing how, at a level below C or GLSL, how the GPU itself is being instructed to perform operations. Can you please explain? Is there some snippet of assembly that demonstrates how it works, or anything like that? Or is there another set of some sort of "GPU registers" in addition to the 16 "CPU registers" on x86 for example?

The GPU driver compiles it to something the GPU understands, which is something else entirely than x86 machine code. For example, here's a snippet of AMD R600 assembly code:
00 ALU: ADDR(32) CNT(4) KCACHE0(CB0:0-15)
0 x: MUL R0.x, KC0[0].x, KC0[1].x
y: MUL R0.y, KC0[0].y, KC0[1].y
1 z: MUL R0.z, KC0[0].z, KC0[1].z
w: MUL R0.w, KC0[0].w, KC0[1].w
01 EXP_DONE: PIX0, R0
END_OF_PROGRAM
The machine code version of that would be executed by the GPU. The driver orchestrates the transfer of the code to the GPU and instructs it to run it. That is all very device specific, and in the case of nvidia, undocumented (at least, not officially documented).
The R0 in that snippet is a register, but on GPUs registers usually work a bit differently. They exist "per thread", and are in a way a shared resource (in the sense that using many registers in a thread means that fewer threads will be active at the same time). In order to have many threads active at once (which is how GPUs tolerate memory latency, whereas CPUs use out of order execution and big caches), GPUs usually have tens of thousands of registers.

Those languages are translated to machine code via a compiler. That compiler just is part of the drivers/runtimes of the various APIs, and is totally implementation specific. There are no families of common instruction sets we are used to in CPU land - like x86, arm or whatever. Different GPUs all have their own incompatible insruction set. Furthermore, there are no APIs with which to upload and run arbitrary binaries on those GPUs. And there is little publically available documentation for that, depending on the vendor.
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
Well, you can. In theory, at least. If you do not care about the fact that your code will only work on a small family of ASICs, and if you have all the necessary documentation for that, and if you are willing to implement some interface to the GPU allowing to run those binaries, you can do it. If you want to go that route, you could look at the Mesa3D project, as it provides open source drivers for a number of GPUs, including an llvm-based compiler infrastructure to generate the binaries for the particular architecture.
In practice, there is no useful way of bare metal GPU programming on a large scale.

differences between virtual and real architecture of cuda

Trying to understand the differences between virtual and real architecture of cuda, and how the different configurations will affect the performance of the program, e.g.
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
...
The following explanation was given in NVCC manual,
GPU compilation is performed via an intermediate representation, PTX
([...]), which can
be considered as assembly for a virtual GPU architecture. Contrary to an actual graphics
processor, such a virtual GPU is defined entirely by the set of capabilities, or features,
that it provides to the application. In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two architectures: a compute
architecture to specify the virtual intermediate architecture, plus a real GPU architecture
to specify the intended processor to execute on. For such an nvcc command to be valid,
the real architecture must be an implementation (someway or another) of the virtual
architecture. This is further explained below.
The chosen virtual architecture is more of a statement on the GPU capabilities that
the application requires: using a smallest virtual architecture still allows a widest range
of actual architectures for the second nvcc stage. Conversely, specifying a virtual
architecture that provides features unused by the application unnecessarily restricts the
set of possible GPUs that can be specified in the second nvcc stage.
But still don't quite get how the performance will be affected by different configurations (or, maybe only affect the selection of the physical GPU devices?). In particular, this statement is most confusing to me:
In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.

The NVIDIA CUDA Compiler Driver NVCC User Guide Section on GPU Compilation provides a very thorough description of virtual and physical architecture and how the concepts are used in the build process.
The virtual architecture specifies the feature set that is targeted by the code. The table listed below shows some of the evolution of the virtual architecture. When compiling you should specify the lowest virtual architecture that has a sufficient feature set to enable the program to be executed on the widest range of physical architectures.
Virtual Architecture Feature List (from the User Guide)
compute_10 Basic features
compute_11 + atomic memory operations on global memory
compute_12 + atomic memory operations on shared memory
+ vote instructions
compute_13 + double precision floating point support
compute_20 + Fermi support
compute_30 + Kepler support
The physical architecture specifies the implementation of the GPU. This provides the compiler with the instruction set, instruction latency, instruction throughput, resource sizes, etc. so that the compiler can optimally translate the virtual architecture to binary code.
It is possible to specify multiple virtual and physical architecture pairs to the compiler and have the compiler back the final PTX and binary into a single binary. At runtime the CUDA driver will choose the best representation for the physical device that is installed. If binary code is not provided in the fatbinary file the driver can use the JIT runtime for the best PTX implementation.

"Virtual architecture" code will get compiled by a just-in-time compiler before being loaded on the device. AFAIK, it is the same compiler as the one NVCC invokes when building "physical architecture" code offline - so I don't know if there will be any differences in the resulting application performance.
Basically, every generation of the CUDA hardware is binary incompatible with previous generation - imagine next generation of Intel processors sporting ARM instruction set. This way, virtual architectures provide an intermediate representation of the CUDA application that can be compiled for compatible hardware. Every hardware generation introduces new features (e.g. atomics, CUDA Dynamic Parallelism) that require new instructions - that's why you need new virtual architectures.
Basically, if you want to use CDP you should compile for SM 3.5. You can compile it to device binary that will have assembly code for specific CUDA device generation or you can compile it to PTX code that can be compiled into device assembly for any device generation that provides these features.

The virtual architecture specifies what capabilities a GPU has and the real architecture specifies how it does it.
I can't think of any specific examples off hand. A (probably not correct) example may be a virtual GPU specifying the number of cores a card has, so code is generated targeting that number of cores, whereas the real card may have a few more for redundancy (or a few less due to manufacturing errors) and some methods of mapping to the cores that are actually in use, which can be placed on top of the more generic code generated in the first step.
You can think of the PTX code sort of like assembly code, which targets a certain architecture, which can then be compiled to machine code for a specific processor. Targeting the assembly code for the right kind of processor will, in general, generate better machine code.

well usually what nvidia writes as document causes people (including myself) to become more confused! (just me maybe!)
you are concerned with the performance, basically what this says is that don't be (probably) but you should.basically the GPU architecture is like nature. they run something on it and something happens. then they try to explain it. and then they feed it to you.
at the end should probably run some tests and see what configuration gives the best result.
the virtual architecture is what is designed to let you think freely. you should obey that, use as much as threads as you want, you can assign virtually everything as number of threads and blocks, doesn't matter, it will be translated to PTX and the device will run it.
the only problem is, if you assign more than 1024 threads per a single block you will get 0 s as the result, because the device(the real architecture) doesn't support it.
or for example your device support the CUDA 1.2, you can define double pointing variables in your code, but again you will get 0 s as the result because simply the device can't run it.
performance wise you have to know that every 32 thread (e.g. warps) have to access a single position in memory or else your access will be serialized and so on.
So I hope you've got the point by now, It is a relatively new science and GPU is a really sophisticated piece of hardware architecture, everybody is trying to make the best of it but it's a game of testing and a little knowledge of actual architecture behind CUDA. I suggest that search for GPU architecture and see how the virtual threads and thread blocks are actually implemented.

What is ABI(Application Binary Interface)?

This is what wikipedia says:
In computer software, an application
binary interface (ABI) describes the
low-level interface between an
application (or any type of) program
and the operating system or another
application.
ABIs cover details such as data type,
size, and alignment; the calling
convention, which controls how
functions' arguments are passed and
return values retrieved; the system
call numbers and how an application
should make system calls to the
operating system; and in the case of a
complete operating system ABI, the
binary format of object files, program
libraries and so on. A complete ABI,
such as the Intel Binary Compatibility
Standard (iBCS), allows a program
from one operating system supporting
that ABI to run without modifications
on any other such system, provided
that necessary shared libraries are
present, and similar prerequisites are
fulfilled.
I guess that an ABI is a convention or standard, and compilers/linkers use this convention to produce object codes. Is that right? If so who made these conventions(companies or some organization)? What was it like when there was no ABIs? Is there documents about these ABIs that we can refer to?

You're correct about the definition of an ABI, up to a point. The classic example is the syscall interface in Linux (and other UNIXes).
They are a standard way for code to request the operating system to carry out certain duties.
As such, they're decided by the people that wrote the OS or, in the case where the syscalls have been added later, by whoever added them (in cases where the OS allows this). For example, the Linux syscall interface on x86 states that you load the syscall number into eax, with other parameters placed in ebx, ecx and so on, depending on the syscall you're making (eax).
Typically, it's not the compiler or linker which do the work of interfacing, rather it's the libraries provided for the language you're using.
Returning to Linux, the GNU C libraries contain code for fopen (for example) which eventually call the relevant syscall to perform the lower level tasks (syscall number 5, open). A list of the syscalls can be found in this PDF file.

Specification is more suitable term than convention, as convention is loose term for widely accepted practice whereas specification is well-defined.
You are right. The specification is made by standardization body. Take a look at POSIX specification which is supported by Windows and compiler/build tool-chains such as gcc assume OS's to adhere by it, and even Linux kernel partially (almost exactly) adheres to it.
Before ABIs? Even today, firmware is hand-crafted as new chips come along for set-top boxes and such other devices having embedded systems.
The documentation is digital logic content in the data-sheet for the chips to be programmed by assembly language and for higher-level language, the cross-compiler tool-chain documentation gives away the assumptions that should be part of ABI.

Well, the concept of ABI was presumably conceived to support the binary compatibility of your program on other operating systems and machine architectures. So, lets suppose that you wrote a program on some operating system distribution running on x86 architecture. Now, for a programmer the most important thing is that this program that you wrote on your machine should be able to run exactly the same on any other machine running on same or different architecture lets say for the sake of discussion that the other machine is running on i386 architecture and this is where the concept of ABI or Application Binary Interfaces comes in. As every machine architecture defines its own way in which the operating system kernal talks to the outside world i.e user-space programs, hence every architecture defines a different set of system calls, machine registers, how those registers are used, how are software interrupts handled by the kernal and so on. ABI is the thing that handles these things for you like compiling, linking, byte ordering and so on. System programmers have had hard luck defining a uniform ABI for same operating systems running on different architectures and that is why every machine architecture has its own and you need to compile your programs in order to confirm to the format those machines have.

What does executable file actually contain?

What does executable actually contain ? .. Does it contain instructions to processor in the form of Opcode and Operands ? If so why we have different executables for different operating systems ?

Processors understand programs in terms of opcodes - so your intution about executables containing opcodes is correct, and you guessed correctly that any executable has to have opcodes and operands for executing the program on a processor.
However, programs mostly execute with the help of operating systems (you can write programs which do not use an OS to execute, but that would be a lot of unnecessary work) - which provide abstractions on top of the hardware which the programs can use. The OS is responsible for setting up a "context" for any program to run i.e. provide the program the memory it needs, provide general purpose libraries which the program can use for doing common stuff such as write to files, print to console etc.
However, to set up the context for the program (provide it memory, load its data, set up a stack for it), the OS needs to read a program's executable file and needs to know a few things about the program such as the data which the program expects to use, size of that data, the initial values stored in that data region, the list of opcodes that make up the program (also called the text region of a process), their size etc. All of this data and a lot more (debugging information, readonly data such as hardcoded strings in the program, symbol tables etc) is stored within the executable file. Each OS understands a different format of this executable file, since they expect all this info to be stored in the executable in different ways. Check out the links provided by Groo.
A couple of formats that have been used for storing information in an executable file are ELF and COFF on UNIX systems and PE on Windows.
P.S. - Not all programs need executable formats. Look up bootloaders on Google. These are special programs which occupy the first sector of a bootable partition on the hard-disk and are used to load the OS itself.

Yes, code in the form of opcodes and operands, and data of course. Anything you want to do that involves the operating system in any way depends on the operating system, not on the CPU. That is why you need different programs for different operating systems. Opening a window in Windows is not done with the same sequence of instructions as in Linux, and so on.

As unwind implied in his answer, an executable file contains calls to routines in the Operating System.
It would be extremely inefficient for an executable file to try to implement functions already provided by the OS (for example, writing to disk, accepting input) so heavy use is made of calls to the OS functions.
Different Operating Systems provide functions which do similar things, but the details of how to call those functions (and where they are) may be different.
So, apart from the major differences of processor type, executables written for one OS won't work with another.

To do any form of IO, an executable needs to interface with the Operating System using sys-calls. in Windows these are calls to the Win32 API and on linux/unit these are mostly posix calls.
Furthermore, the executable file format differs with the OS the same way a PNG file differs from a GIF file. the data is ordered differently and there are different headers and sub-headers.

An Executable file contains several blobs of data and instructions on how the datas should be loaded into memory. Some of these sections happen to contain machine code that can be executed. Other sections contain program data, resources, relocation information, import information etc.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008