Understand foreign function interface (FFI) and language binding - language-agnostic

Mixing different programming languages has long been something I don't quite understand. According to this Wikipedia article, a foreign function interface (or FFI) can be done in several ways:
Requiring that guest-language functions which are to be host-language callable be specified or implemented in a particular way; often using a compatibility library of some sort.
Use of a tool to automatically "wrap" guest-language functions with appropriate glue code, which performs any necessary translation.
Use of wrapper libraries
Restricting the set of host language capabilities which can be used cross-language. For example, C++ functions called from C may not (in general) include reference parameters or throw exceptions.
My questions:
What are the differences between the
1st, 2nd and 3rd ways? It seems to
me they are all to compile the code of
the called language into some
library with object files and header
files, which are then called by the
calling language.
One source it links says,
implementing an FFI can be done in
several ways:
Requiring the called functions in the target language implement a
specific protocol.
Implementing a wrapper library that takes a given low-language
function, and "wraps" it with code to do data conversion to/from the
high-level language conventions.
Requiring functions declared native to use a subset of high-level functionality (which is compatible with the low-level language).
I was wondering if the first way in
the linked source is the same as the
first way in Wikipedia?
What does the third way in this
source mean? Does it corresponds to the 4th way in Wikipedia?
In the same source, when comparing the three ways it lists, it seems to say
the job of filling the gap between
the two languages is gradually
shifted from the called language
to the calling language. I was
wondering how to understand that? Is this shifting also true for the four ways in Wikipedia?
Are Language binding and FFI
equivalent concepts? How are they
related and differ?
a binding from a programming language
to a library or OS service is an API
providing that service in the
language.
I was wondering which way in the quotation from Wikipedia or from the source each of the following examples belongs to?
Common Object Request Broker Architecture (CORBA)
Calling C in C++, by the extern "C" declaration in C++ to
disable name mangling.
Calling C in Matlab, by MATLAB Interface to Shared Libraries, i.e., first compiling C code to shared library via general C
compiler such as gcc, and then
loading, calling a function from
and unloading the shared library
via Matlab functions
loadlibrary(), calllib() and
unloadlibrary().
Calling C in Matlab, by Creating C/C++ Language MEX-Files
Calling Matlab in C, by mcc compiler
Calling C++ in Java, by JNI, and Calling Java in C++, also by JNI
Calling C/C++ in other languages, Using SWIG
Calling C in Python, by Ctypes module.
Cython
Calling R in Python, by RPy
Programming Language Bindings to OpenGL from various languages, such as Python, Fortran and Java
Bindings for a C library, such as Cairo, from various languages,
such as C++, Python, Java, Common Lisp

May be a specific example will help. Let us take the host language as Python and the guest language as C. This means that Python will be calling C functions.
The first option is to write the C library in a particular way. In the case of Python the standard way would be to have the C function written with a first parameter of Py_Object * among other conditions. For example (from here):
static PyObject *
spam_system(PyObject *self, PyObject *args)
{
const char *command;
int sts;
if (!PyArg_ParseTuple(args, "s", &command))
return NULL;
sts = system(command);
return Py_BuildValue("i", sts);
}
is a C function callable from Python. For this to work the library has to be written with Python compatibility in mind.
If you want to use an already existing C library, you need another option. One is to have a tool that generates wraps this existing library in a format suitable for consumption by the host language. Take Swig which can be used to tie many languages. Given an existing C library you can use swig to effectively generate C code that calls your existing library while conforming to Python conventions. See the example for building a Python module.
Another option to us an already existing C library is to call it from a Python library that effectively wraps the calls at run time, like ctypes. While in option 2 compilation was necessary, it is not this time.
Another thing is that there are a lot of options (which do overlap) for calling functions in one language from another language. There are FFIs (equivalent to language bindings as far as I know) which usually refer to calling between multiple languages in the same process (as part of the same executable, so to speak), and there are interprocess communication means (local and network). Things like CORBA and Web Services (SOAP or REST) and and COM+ and remote procedure calls in general are of the second category and are not seen as FFI. In fact, they mostly don't prescribe any particular language to be used at either side of the communication. I would loosely put them as IPC (interprocess communication) options, though this is simplification in the case of network based APi like CORBA and SOAP.
Having a go at your list, I would venture the following opinions:
Common Object Request Broker Architecture: IPC, not FFI
Calling C in C++, by the extern "C" declaration in C++ to disable name mangling. ****
Calling C in Matlab, by MATLAB Interface to Shared Libraries Option 3 (ctypes-like)
Calling C in Matlab, by Creating C/C++ Language MEX-Files Option 2 (swig-like)
Calling Matlab in C, by mcc compiler Option 2 (swig-like)
Calling C++ in Java, by JNI, and Calling Java in C++ by JNI Option 3 (ctypes-like)
Calling C/C++ in other languages, Using SWIG Option 2 (swig)
Calling C in Python, by Ctypes Option 3 (ctypes)
Cython Option 2 (swig-like)
Calling R in Python, by RPy Option 3 (ctypes-like) in part, and partly about data exchange (not FFI)
The next two are not foreign function interfaces at all, as the term is used. FFi is about the interaction between two programming languages and should be capable of making any library (with suitable restrictions) from one language available to the other. A particular library being accessible from one language does not an FFI make.
Programming Language Bindings to OpenGL from various languages
Bindings for a C library from various languages

Related

At a language level, what exactly is `ccall`?

I'm new to Julia, and I'm trying to understand, at the language level, what ccall is. At the syntax level, it looks like a normal function, but it clearly doesn't behave the same way in how it takes its arguments:
Note that the argument type tuple must be a literal tuple, and not a
tuple-valued variable or expression.
Additionally, if I evaluate a variable bound to a function in the Julia REPL, I get something like
julia> max
max (generic function with 15 methods)
But if I try to do the same with ccall:
julia> ccall
ERROR: syntax: invalid "ccall" syntax
Clearly, ccall is a special piece of syntax, but it's also not a macro (no # prefix, and invalid macro usage gives a more specific error). So, what is it? Is it something baked into the language, or something I could define myself with some language construct I'm not familiar with?
And if it is some baked-in piece of syntax, why was it decided to use function call notation, instead of implementing it as a macro or designing a more readable and distinct syntax?
In the current nightly (and thus, upcoming 0.6 release), much of the special behavior you observe has been removed (see this pull-request). ccall is no longer a reserved word, so it can be used as a function or macro name.
However there is still a slight oddity: defining a 3- or 4-argument function called ccall is allowed, but actually calling such a function will give an error about ccall argument types (other numbers of arguments are ok). The reasons go directly to your question:
So, what is it? Is it something baked into the language
Yes, ccall, though it will no longer be a keyword in 0.6, is still "baked in" to the language in several ways:
the :ccall([four args...]) expression form is recognized and specially handled during syntax lowering. This lowering step does several things including wrapping arguments in a call to unsafe_convert, which allows for customized conversion from Julia objects to C-compatible objects; as well as pulling out arguments that might need to be rooted to prevent garbage collection of a referenced object during the ccall. (see code_lowered output, or try the expand function; more info on the compiler here).
ccall requires extensive handling in the code generation backend, including: look-up of the requested function name in the specified shared library, and generation of an LLVM call instruction -- which is eventually translated to platform-specific machine code by the LLVM Just-In-Time compiler. (see the different stages with code_llvm and code_native).
And if it is some baked-in piece of syntax, why was it decided to use
function call notation, instead of implementing it as a macro or
designing a more readable and distinct syntax?
For the reasons detailed above, ccall requires special handling whether it looks like a macro or a function. In this mailing list thread, one of the Julia creators (Stefan Karpinski) commented on why not to make it a macro:
I suppose we could reimplement it as a macro, but that would really just be pushing the magic further down.
As far as "a more readable and distinct syntax", perhaps that is a matter of taste. It's not clear to me why some other syntax would be preferable (except for the convenience of a LuaJIT/CFFI-style inline C syntax parsing, of which I am a fan). My only strong personal wish for ccall would be to have arguments and types entered adjacent (e.g. ccall((:foo, :libbar), Void, (x::Int, y::Float))), because working with longer argument lists can be inconvenient. In 0.6 it will be possible to implement this form as a macro!
In Julia 0.5 and earlier.
It is not a function and it is not a macro.
It is indeed something special baked into the language.
It is an Intrinsic.
In julia 0.6 this changes
It a lot of ways it is more like a Macro than a function call.
But in other ways it is not -- it does not return an AST.
It does call a function and on a low enough level it looks similar to calling a julia function.
The history of why it looks the way it does is beyond me, you'ld need to hear from one of the people who worked on the earliest code for the language.
Right now it is everywhere, and is one of the harder things to change -- but not impossible. It would trigger up for 3 years of bikeshedding though :-P .
I like to think of ccall as being two things.
Foreign Function Interface, for C and other compiled languages (eg Fortran, Rust apparently work)
Way to access the raw guts of the language "runtime".
Foreign Function Interface (FFI)
Most of the time when one uses ccall in a package one wants to invoke some code that is in a compile library. In this sense it is C-Call, like R-Call, or Py-Call.
I think mlewe/BlossomV.jl is a nice compact example.
For a more intense example oxinabox/SLEEF.jl.
As an FFI, it does not have to share memory space/a process with julia -- PyCall.jl does, RCall.jl and Matlab.jl don't.
It doesn't matter as long as the result comes back.
In these cases it is theoretically possible to replace ccall with some kind of safe_ccall which would run the called library in a separate process, and would not segfault julia if the library being called segfaulted.
But as of yet, no-one has written such a method/package.
Using ccall for FFI is even done in Base, like for accessing MPFR to define BigFloat.
But this is not the main reason ccall is used in Base.
Accessing the guts of the language.
ccall is really what drives a large portion of the program "doing a thing".
It is used throughout Base, to call the functions from src.
For this, ccall basically triggers a function call at the compiled level, that shifts the instruction pointer directly into the compiled code of the ccalled function. Like calling a function would if the whole thing had been written in say C.
You can see in base/threadingconstructs.jl ccall being used to manage work on threads -- that triggers code from src/threading.c.
It is used to map a section of disk to memory. mmap.jl. -- obviously can't be done from another process.
It is used to make a section of code non-intruptable
It is used call LibC to do things like malloc to allocate memory (though right now this is mostly used as part of FFI).
There are tricks you can do with ccall to #undef a variable after it has already been assigned.
ccall is in many ways the "master" key to the language.
Conclusion
I've described ccall here as two things, a FFI function and a core part of the language "runtime". This duality is not real, and there is plenty of overlap, like filehandling (is it FFI?).
The behavour many expect ccall to have comes from its FFI uses.
Here ccall could just be a function.
The behaviour it actually has comes from it's use as a core part of the language -- linking the julia code of the standard library in Base to the low level C code from src.
Allowing the very direct control over the running of the julia process.

Why compilers don't translate in simpler languages?

Usually compilers translate from the language they support to assembly. Or at most to an assembly-like language (bytecode), like GIMPLE/GENERIC for GCC or Python/Java/.NET bytecode.
Wouldn't it be simpler for a compiler translate to a simpler language, which already implement a big subset of their grammar?
For example an Objective-C compiler, which is 100% compatible with C, could add the semantics only for the syntax it extends to C's, translating it into C. I can see many advantages of doing this; one could use this Objective-C compiler to translate its code into C in order to compile the generated C code with a different compiler that doesn't support C++ (but that optimizes more, or that compiles quicker, or able to compile for more architectures). Or one would be able to use the generated C code in a project where only C is allowed.
I guess/hope that if things were working like this, it would have been a lot easier to write extensions for current languages (eg: adding to C++ keywords to ease the implementation of common patterns, or, still in C++, removing the declare before use rule by moving inline member functions to the end of header files)
What kind of penalties would there be? Generated code would be very difficult to be understood by humans? Compilers wouldn't be able to optimize as much as they can now? What else?
This is actually used by a lot of languages, through the use of Intermediate languages. The biggest example for this would be Pascal, which had the Pascal-P system: Pascal was compiled into a hypothetical assembly language. To port pascal would only mean making a compiler for this assembly language: a task a lot simpler than porting the entire pascal compiler. After writing this compiler, you'd only need to compile the (machine-independent) pascal compiler that was written in this.
Bootstrapping is also used quite often in programming language design. Many languages have their compilers written in the same language(Haskell comes to mind here). By doing this, writing a new functionality for the language simply means translating that idea into the current language, putting it into the compiler, then recompiling.
I don't think the problem with this method is really the readability of generated code(I don't sift through assembly byte-code generated through compilers, personally), but one of optimization. Many ideas in higher-level programming languages( weak-typing comes to mind) are hard to automatically translate into lower-level system languages such as C. There's a reason why GCC tends to do its optimization before code generation.
But for the most part, compilers do translate into simpler languages except for maybe the most basic of system languages.
Incidentally, as a counterexample, Tcl is one language that is known to be very-very hard (if not totally impossible) to translate to C. Over the last 20 years there have been a couple of projects that tried this, even one promise of a commercial product but none have materialized.
In part it is because Tcl is a very dynamic language (as any language with an eval function is). In part it is because the only way to know if something is code or data is to run the program.
Since Objective-C is a strict superset of C and C++ contains a very large amount that is a lot like C, to parse either you effectively already need to be able to parse C. In which case, outputting to machine code and outputting to more C code aren't substantially different in processing cost, the main cost to the user being that compiling now takes as long as it originally did plus the amount of time a second compiler takes.
Any attempt to copy and paste the stuff that looks like C and translate the rest around it would be prone to problems. Firstly, C++ isn't a strict superset of C so things that look like C don't necessarily compile exactly the same anyway (especially versus C99). And even if they did, supposing a user made an error in their C stuff, compilers don't tend to provide error information in a machine readable format, so it'd be really hard for the Objective-C to C layer to give the user a meaningful error after receiving e.g. "error at line 99".
That said, many compiler suites, like GCC and even more so like the upcoming Clang + LLVM, use an intermediate form to decouple the bit that knows about the specifics of one architecture from the bit that knows the specifics of a particular language. However, it tends to be more of a data structure than something intentionally easy to express as a written language.
So: compilers don't work like this for purely practical reasons.
Haskell is actually compiled this way: the GHC compiler first translates the source code to an intermediary functional language (which is less rich than Haskell self), performs optimizations and then lowers the whole thing to C code which is then compiled by GCC. This solutions has problems tough, and projects were started to replace this backend.
http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html
There is a compilers construction stack which is fully based on this idea. Any new language is implemented as a trivial translation into a lower level language or a combination of languages which are already defined within this stack.
http://www.meta-alternative.net/mbase.html
However, in order to be able to do so, you'd need at least some metaprogramming capabilities in every little language you add to a hierarchy. This requirement adds some severe limitations on languages semantics.

How do they write different language wrappers for same library?

Generally a library will be released in a single language (for example C). If the library tuns out to be useful then many language wrappers for that library will be written. How exactly do they do it?
Kindly someone throw little light on this topic. If it is too language dependent pick language of your choice and explain it.
There are a few options that come to mind:
Port the original C library to the language/platform of your choice
Compile the C library into something (like a DLL) that can be invoked from other components
Put the library on the web, expose an API over HTTP and wrap that on the client
If I wanted to wrap a C library with a managed (.NET) layer, I'd compile the library into a DLL, exposing the APIs I wanted. Then, I'd use P/Invoke to call those APIs from my C# code.

What application virtual machines are written in high level languages?

What application virtual machines are out there that are written in higher level languages? C/C++ looks like the languages of choice (for obvious reasons).
What I have found on google is at least two written in Java (both meta-circular) :
JikesRVM and Maxine.
Anything else that you have found?
Many Scheme implementations are written in Scheme and although many of those are compilers or interpreters, some of those are VMs,
some CommonLisp implementations are written in CommonLisp and although many of those are compilers or interpreters, some of those are VMs,
the PyPy VM is written in RPython, which is a subset of Python with "syntax and semantics of Python, speed of C, restrictions of Java and compiler error messages as penetrable as MUMPS",
the Squeak Smalltalk VM is written in Slang (a subset of Squeak Smalltalk) and
the Klein Metacircular VM is written entirely in Self.
Of those, the most interesting are Klein and Maxine (whose design is actually based on Klein). Metacircular Lisp and Scheme implementations usually assume the existence of some basic primitive special forms, which then have to be implemented in assembler, C or a limited subset of the language in a low-level style. Squeak and PyPy use a limited subset of the language. Jikes uses "magic" methods and low-level style.
The idea of Klein and Maxine is that everything is written in high-level, object-oriented, expressive, idiomatic style. In the current version of Klein, there are only two tiny places where the style is hampered by some restriction: in the implementation of message sending, you cannot send any messages and in the implementation of object cloning you cannot clone any objects. However, the current compiler can actually inline or even completely optimize away object cloning and message sending, so those two places could be rewritten in normal OO Self style – it's just that nobody has done it yet.
All of that was just metacircular VMs. There's also other VMs written in high-level languages:
HotRuby is a Ruby VM (actually, a YARV VM) written in JavaScript,
Red Sun is a Ruby VM (actually, a YARV VM) written in ActionScript,
Rava is a JVM-like VM written in Ruby by Koichi "ko1" Sasada, the author of YARV and
Ruva is a JVM-like VM written in Ruby
Some more VM implementations are in TCL (tool command language) and lua (sometimes named as java) and some are written in an assembler. Other variants are written in a computer hardware system programming language of the manufacturer.

What features of interpreted languages can a compiled one not have?

Interpreted languages are usually more high-level and therefore have features as dynamic typing (including creating new variables dynamically without declaration), the infamous eval and many many other features that make a programmer's life easier - but why can't compiled languages have these as well?
I don't mean languages like Java that run on a VM, but those that compile to binary like C(++).
I'm not going to make a list now but if you are going to ask which features I mean, please look into what PHP, Python, Ruby etc. have to offer.
Which common features of interpreted languages can't/don't/do exist in compiled languages? Why?
Whether source code is compiled - to native binaries, some kind of intermediate language (Java Bytecode/IL) - or interpreted is absolutely no trait of the language. It's just a question of the implementation.
You can actually have both compilers and interpreters for the same language like
Haskell: GHC <-> GHCI
C: gcc <-> ch
VB6: VS IDE <-> VB6 compiler
Certain language features like eval or dynamic typing may suggest a distinction between so called "dynamic languages" and static ones, but how this is run can never be the primary question.
Initially, one of the largest benefits of interpreted languages was debugging. That way you can get incredibly accurate and detailed information when looking for the reason a program isn't working. However, most compilers have become advanced enough that that is not too big of a deal any more.
The other main benefit (in my opinion anyway), is that with interpreted languages, you don't have to wait for eternity for your project to compile to test it out.
You couldn't plausibly do eval, for example, for reasons I'd have thought were pretty obvious: exactly how would you implement it? Make the runtime contain a full copy of the compiler? Every time you wanted to evaluate a string (keeping in mind that each time it could be different!) you'd save the string to a file, run the compiler on it to make a DLL/shared-lib, then load that DLL/shared-lib and call your code? You can't see why this might be a wee bit impractical? ;)
You can find this kind of thing in dynamic languages all over the place that you can't do with static code short of basically running an interpreter, in effect, behind the scenes.
Continuing on from Dario - I think you are really asking why a compiled program can't evaluate statements at runtime (e.g. eval). Here's some reasons I can think of:
The full compiler would have to be distributed with the program (or be part of the program)
For an eval function to have access to type information and symbols (such as variable names and function names) in the environment it was used the original program would have to be compiled with those symbols accessible (compiled languages usually remove these symbols at compile time).
Edit: As noted neither of these reasons make it impossible for a language/compiler to be able to evaluate code at runtime, but they are definitely things that need to be taken into consideration when developing a compiler or when designing a language.
Maybe the question is not about interpreted/compiled languages (compile is ambiguous anyway) but about languages that do/don't carry their own compiler around with them? For instance we've said C++ could do eval with a handy compiler floating around in the app, and reflection presumably is similar in some ways.