What is the difference between Apache Drill's ValueVectors and Apache Arrow?

What is the difference between Apache Drill's ValueVectors and Apache Arrow? - apache-drill

Apache Drill has its own columnar representation like Apache Arrow. But Apache Arrow has support for more programming languages. I am looking forward to use Apache Drill but still I want the programming language support of Apache Arrow.
Some sources say that, Apache Arrow has its roots in Apache Drill's ValueVectors.
Drill represents data internally as JSON documents – similar to
MongoDB and Elasticsearch. These JSON documents are "shredded" into
columns, which allows Drill to deliver the performance enhancements of
columnar analytics but retain the ability to query complex data. Note,
this internal representation is not based on Apache Arrow. - Source
Why cannot Apache Drill make use of the Apache Arrow project? How is Drill's internal representation differ from Apache Arrow and what advantages Arrow has over Drill's ValueVectors and vice-versa.

The Apache Arrow Java library started out as a fork of Drill's ValueVectors as the Apache Arrow project began at the beginning of 2016. The memory representation is nearly the same; one significant difference is that Arrow uses 1 bit to represent whether a vector slot is null, will Drill uses 1 byte. We decided to change this for reasons of memory efficiency and for using popcount intrinsic operations to check whether a batch of values contain any nulls.
It has been discussed whether to use exactly Arrow's representation in Apache Drill, but there is no timeline for this to happen. The relevant issue is https://issues.apache.org/jira/browse/DRILL-4455
Apache Arrow has been developed as an open standard with a public API in many programming languages. We have some level of support now for 11 programming languages, either through native implementations or bindings. This include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
I am not aware of any performance analysis comparing the memory representations, but the difference relating to null representation is unlikely to cause a significant difference.

Drill's community is considering to move onto Apache Arrow. Please take a look the following tickets:
https://issues.apache.org/jira/browse/ARROW-3164
https://issues.apache.org/jira/browse/DRILL-4455
But it is on hold right now, since there were a lot of changes and improvements in both projects. So there are some differences in Terminology, Metadata Notation, Data Types, Data Layout..
You can reply to this mail thread in drill dev mailing list to discuss it further:
https://lists.apache.org/thread.html/8d895fb40702f3120532f15594ea935a818ac0eb5acdb4fd1248d89f#%3Cdev.drill.apache.org%3E
Also contributions are very welcome :)

Related

Protocol Buffer vs Json - when to choose one over another

Can anyone explain when to use protocol buffer instead of JSON for micro-services architecture? And vice-versa? Both on synchronous and asynchronous communication.

When to use JSON
You need or want data to be human readable
Data from the service is directly consumed by a web browser
Your server side application is written in JavaScript
You aren’t prepared to tie the data model to a schema
You don’t have the bandwidth to add another tool to your arsenal
The operational burden of running a different kind of network service
is too great
Pros of ProtoBuf
Relatively smaller size
Guarantees type-safety
Prevents schema-violations
Gives you simple accessors
Fast serialization/deserialization
Backward compatibility
While we are at it, have you looked at flatbuffers?
Some of the aspects are covered here google protocol buffers vs json vs XML
Reference:
https://codeclimate.com/blog/choose-protocol-buffers/
https://codeburst.io/json-vs-protocol-buffers-vs-flatbuffers-a4247f8bda6f

I'd use JSON when the consumer is or could possibly be written in a language with built-in native support for JSON (Javascript is an example), a web browser, or where human readability is wanted. Speaking of which, at least for asynchronous calls, many developers enjoy the convenience of examining the contents of the queue directly for debugging and even during the normal course of development. Depending on the tech stack used, it may or may not be worth the trade off to use protobuf just to reduce network load since any performance increase wont buy you much in the async world. And it's not like we need to write a bunch of boiler plate code anymore like we used to with JSON marshalling and unmarshalling in most languages.
I'd use protobuf for everything else... if there are any other use cases left for it with the considerations above. There are advantages you might see, such as performance, network load, the backwards compatibility offered by its versioning scheme, the lovely documentation that magically comes with proto files, and some validation! If for some reason you have a lot of REST or other synchronous calls between microservices, protobuf can be sent over the wire instead of JSON without many trade offs, if any at all, while offering a heap of advantages.

Build system that is not file-centric

We have a software infrastructure which works pretty much like a software build system: Information is gathered from different sources and used to generate some outputs. Like in traditional software builds we have different types of output, dependency trees, etc.
The main difference is that our sources, intermediate results and outputs are not inherently file-based. Rather, they're (uniquely addressable) data objects.
Right now we're mapping our data structure to files and directories in combination with a traditional build system (SCons) but that does not scale, both w.r.t. performance but (more importantly) w.r.t. maintainability. Hence I'm looking for an infrastructure that's built for this purpose from the ground up.
As an illustration, assume you have 3 XML documents A, B and C. Let's say that B/foo/bar is to be calculated from A/x/y and A/x/z, and that similarly C/a/b is calculated from A/x/y. I need an infrastructure to
Implement these relationships (i.e. the transformations and their dependencies)
Automatically re-build the relevant parts after changes are made
One major problem with using files is that, if I map A, B and C to some files A.xml, B.xml and C.xml and use a traditional build system, then any change to A.xml will trigger a rebuild of B.xml and C.xml, even if A/x/y and A/x/z (the original dependencies of B) are not modified. For a fine-grained dependency resolution I therefore would need to map each of A, B and C not to a file, but to a directory where each sub-directory represents an element, files represents attributes, etc. As I said, this does not scale for us.
(Please note that our system is not actually based on XML)
Right now I'm looking for any existing software, infrastructure or concept which points into this direction, regardless of implementation language and underlying data structures.

It sounds like you need an active object database management system (ODBMS) like GemStone/S. ODBMSs provide the traditional persistence services without the old cost of mapping data structures to files and the well-known benefits of object technology. As you've mentioned dependency trees and addressable objects, in ODBMSs navigational references are stored as part of their data, allowing any complex interaction patterns among objects to be represented/accessed. This is specially true when you predict a system which makes use of inheritance, object nesting and cross-referencing.
Although an object engine may seem oversized for your requirements, it is common for large-scale production business systems to store and execute methods using OODBMs, within a concurrent and multiuser environment. It doesn't come for free because you have to invest in the human part of the equation (education and experience) but once the initial fear is overcome, it will pay the return of investment.
For re-building (subscribed) parts after changes (notifications from announcers) are made, you may use the Observer design pattern, or one of its variants (SASE or Announcements framework), to implement your announce/subscription architecture. Under this type of event frameworks there are intrinsic problems which are hard to solve with traditional file-based solutions, as you have noticed already. For example, it is typical for a dependency mechanism to manage the replacement of an object, or in your example an XML document, by another one. Any modern events framework should manage when an object is removed, all dependents plugged to the old object are updated to the new reference.
Finally, there is a free GemStone/S stack which includes object dependency framework so you may experiment with a real object-database.

So nothing comes to mind that solves exactly your problem, but there are a few tools that might get you a little closer than you are now:
1) You might be able to throw something together using Fuse that would give you better control of how your data objects are mapped out to files. Fuse basically allows you to construct arbitrary file systems from whatever backing data you want. (The python bindings are pretty friendly, but there are a number of other language interfaces available as well). Then you could use a traditional build tool, and take advantage of file like objects better associated w/your data.
2) Cmake has a pretty extensible language for writing custom targets that you might be able to press into service. Unfortunately its language is pretty didactic and has something of a steep learning curve, so it wouldn't be my first choice.

Does Apache Thrift allow foreign function calls between any two languages?

I'm currently trying to develop (an API in multiple programming languages) that can be accessed from (various other programming languages). I've taken a look at Apache Thrift, and it appears that it might be possible to allow seamless foreign function calls between any two languages using Thrift. Is this correct?

Thrift is created to facilitate communication between different processes over the network, not in process FFI. It is probably possible to take some parts of Thrift (like IDL), and adopt it for FFI, but it could be an nontrivial undertaking, and provide suboptimal results.

I have actually been thinking of something similar myself.
There are core concepts to the Thrift specification.
The Transport: This portion is responsible for facilitating data transfer between a client and server.
The Protocol: This portion is responsible for formatting the said data in different ways. It can be a JSON, compressed binary, even raw uncompressed binary.
The Server: This is responsible for putting these things together and managing them.
Thrift allows you to mix these different parts in unique ways to create something suitable to your purpose. Thrift is still very much server-client oriented though.
To develop an API in thrift would mean that you could theoreticallly have plugins in any language. The main software component would launch the sub-process and use STD-IN/OUT as a Transport. This would allow it to make RPC calls regardless of Language.

Framework vs. Toolkit vs. Library [duplicate]

This question already has answers here:
What is the difference between a framework and a library? [closed]
(22 answers)
Closed 6 years ago.
What is the difference between a Framework, a Toolkit and a Library?

The most important difference, and in fact the defining difference between a library and a framework is Inversion of Control.
What does this mean? Well, it means that when you call a library, you are in control. But with a framework, the control is inverted: the framework calls you. (This is called the Hollywood Principle: Don't call Us, We'll call You.) This is pretty much the definition of a framework. If it doesn't have Inversion of Control, it's not a framework. (I'm looking at you, .NET!)
Basically, all the control flow is already in the framework, and there's just a bunch of predefined white spots that you can fill out with your code.
A library on the other hand is a collection of functionality that you can call.
I don't know if the term toolkit is really well defined. Just the word "kit" seems to suggest some kind of modularity, i.e. a set of independent libraries that you can pick and choose from. What, then, makes a toolkit different from just a bunch of independent libraries? Integration: if you just have a bunch of independent libraries, there is no guarantee that they will work well together, whereas the libraries in a toolkit have been designed to work well together – you just don't have to use all of them.
But that's really just my interpretation of the term. Unlike library and framework, which are well-defined, I don't think that there is a widely accepted definition of toolkit.

Martin Fowler discusses the difference between a library and a framework in his article on Inversion of Control:
Inversion of Control is a key part of
what makes a framework different to a
library. A library is essentially a
set of functions that you can call,
these days usually organized into
classes. Each call does some work and
returns control to the client.
A framework embodies some abstract
design, with more behavior built in.
In order to use it you need to insert
your behavior into various places in
the framework either by subclassing or
by plugging in your own classes. The
framework's code then calls your code
at these points.
To summarize: your code calls a library but a framework calls your code.

Diagram
If you are a more visual learner, here is a diagram that makes it clearer:
(Credits: http://tom.lokhorst.eu/2010/09/why-libraries-are-better-than-frameworks)

The answer provided by Barrass is probably the most complete. However, the explanation could easily be stated more clearly. Most people miss the fact that these are all nested concepts. So let me lay it out for you.
When writing code:
eventually you discover sections of code that you're repeating in your program, so you refactor those into Functions/Methods.
eventually, after having written a few programs, you find yourself copying functions you already made into new programs. To save yourself time you bundle those functions into Libraries.
eventually you find yourself creating the same kind of user interfaces every time you make use of certain libraries. So you refactor your work and create a Toolkit that allows you to create your UIs more easily from generic method calls.
eventually, you've written so many apps that use the same toolkits and libraries that you create a Framework that has a generic version of this boilerplate code already provided so all you need to do is design the look of the UI and handle the events that result from user interaction.
Generally speaking, this completely explains the differences between the terms.

Introduction
There are various terms relating to collections of related code, which have both historical (pre-1994/5 for the purposes of this answer) and current implications, and the reader should be aware of both, particularly when reading classic texts on computing/programming from the historic era.
Library
Both historically, and currently, a library is a collection of code relating to a specific task, or set of closely related tasks which operate at roughly the same level of abstraction. It generally lacks any purpose or intent of its own, and is intended to be used by (consumed) and integrated with client code to assist client code in executing its tasks.
Toolkit
Historically, a toolkit is a more focused library, with a defined and specific purpose. Currently, this term has fallen out of favour, and is used almost exclusively (to this author's knowledge) for graphical widgets, and GUI components in the current era. A toolkit will most often operate at a higher layer of abstraction than a library, and will often consume and use libraries itself. Unlike libraries, toolkit code will often be used to execute the task of the client code, such as building a window, resizing a window, etc. The lower levels of abstraction within a toolkit are either fixed, or can themselves be operated on by client code in a proscribed manner. (Think Window style, which can either be fixed, or which could be altered in advance by client code.)
Framework
Historically, a framework was a suite of inter-related libraries and modules which were separated into either 'General' or 'Specific' categories. General frameworks were intended to offer a comprehensive and integrated platform for building applications by offering general functionality, such as cross platform memory management, multi-threading abstractions, dynamic structures (and generic structures in general). Historical general frameworks (Without dependency injection, see below) have almost universally been superseded by polymorphic templated (parameterised) packaged language offerings in OO languages, such as the STL for C++, or in packaged libraries for non-OO languages (guaranteed Solaris C headers). General frameworks operated at differing layers of abstraction, but universally low level, and like libraries relied on the client code carrying out it's specific tasks with their assistance.
'Specific' frameworks were historically developed for single (but often sprawling) tasks, such as "Command and Control" systems for industrial systems, and early networking stacks, and operated at a high level of abstraction and like toolkits were used to carry out execution of the client codes tasks.
Currently, the definition of a framework has become more focused and taken on the "Inversion of Control" principle as mentioned elsewhere as a guiding principle, so program flow, as well as execution is carried out by the framework. Frameworks are still however targeted either towards a specific output; an application for a specific OS for example (MFC for MS Windows for example), or for more general purpose work (Spring framework for example).
SDK: "Software Development Kit"
An SDK is a collection of tools to assist the programmer to create and deploy code/content which is very specifically targeted to either run on a very particular platform or in a very particular manner. An SDK can consist of simply a set of libraries which must be used in a specific way only by the client code and which can be compiled as normal, up to a set of binary tools which create or adapt binary assets to produce its (the SDK's) output.
Engine
An Engine (In code collection terms) is a binary which will run bespoke content or process input data in some way. Game and Graphics engines are perhaps the most prevalent users of this term, and are almost universally used with an SDK to target the engine itself, such as the UDK (Unreal Development Kit) but other engines also exist, such as Search engines and RDBMS engines.
An engine will often, but not always, allow only a few of its internals to be accessible to its clients. Most often to either target a different architecture, change the presentation of the output of the engine, or for tuning purposes. Open Source Engines are by definition open to clients to change and alter as required, and some propriety engines are fixed completely. The most often used engines in the world however, are almost certainly JavaScript Engines. Embedded into every browser everywhere, there are a whole host of JavaScript engines which will take JavaScript as an input, process it, and then output to render.
API: "Application Programming Interface"
The final term I am answering is a personal bugbear of mine: API, was historically used to describe the external interface of an application or environment which, itself was capable of running independently, or at least of carrying out its tasks without any necessary client intervention after initial execution. Applications such as Databases, Word Processors and Windows systems would expose a fixed set of internal hooks or objects to the external interface which a client could then call/modify/use, etc to carry out capabilities which the original application could carry out. API's varied between how much functionality was available through the API, and also, how much of the core application was (re)used by the client code. (For example, a word processing API may require the full application to be background loaded when each instance of the client code runs, or perhaps just one of its linked libraries; whereas a running windowing system would create internal objects to be managed by itself and pass back handles to the client code to be utilised instead.
Currently, the term API has a much broader range, and is often used to describe almost every other term within this answer. Indeed, the most common definition applied to this term is that an API offers up a contracted external interface to another piece of software (Client code to the API). In practice this means that an API is language dependent, and has a concrete implementation which is provided by one of the above code collections, such as a library, toolkit, or framework.
To look at a specific area, protocols, for example, an API is different to a protocol which is a more generic term representing a set of rules, however an individual implementation of a specific protocol/protocol suite that exposes an external interface to other software would most often be called an API.
Remark
As noted above, historic and current definitions of the above terms have shifted, and this can be seen to be down to advances in scientific understanding of the underlying computing principles and paradigms, and also down to the emergence of particular patterns of software. In particular, the GUI and Windowing systems of the early nineties helped to define many of these terms, but since the effective hybridisation of OS Kernel and Windowing system for mass consumer operating systems (bar perhaps Linux), and the mass adoption of dependency injection/inversion of control as a mechanism to consume libraries and frameworks, these terms have had to change their respective meanings.
P.S. (A year later)
After thinking carefully about this subject for over a year I reject the IoC principle as the defining difference between a framework and a library. There ARE a large number of popular authors who say that it is, but there are an almost equal number of people who say that it isn't. There are simply too many 'Frameworks' out there which DO NOT use IoC to say that it is the defining principle. A search for embedded or micro controller frameworks reveals a whole plethora which do NOT use IoC and I now believe that the .NET language and CLR is an acceptable descendant of the "general" framework. To say that IoC is the defining characteristic is simply too rigid for me to accept I'm afraid, and rejects out of hand anything putting itself forward as a framework which matches the historical representation as mentioned above.
For details of non-IoC frameworks, see, as mentioned above, many embedded and micro frameworks, as well as any historical framework in a language that does not provide callback through the language (OK. Callbacks can be hacked for any device with a modern register system, but not by the average programmer), and obviously, the .NET framework.

A library is simply a collection of methods/functions wrapped up into a package that can be imported into a code project and re-used.
A framework is a robust library or collection of libraries that provides a "foundation" for your code. A framework follows the Inversion of Control pattern. For example, the .NET framework is a large collection of cohesive libraries in which you build your application on top of. You can argue there isn't a big difference between a framework and a library, but when people say "framework" it typically implies a larger, more robust suite of libraries which will play an integral part of an application.
I think of a toolkit the same way I think of an SDK. It comes with documentation, examples, libraries, wrappers, etc. Again, you can say this is the same as a framework and you would probably be right to do so.
They can almost all be used interchangeably.

very, very similar, a framework is usually a bit more developed and complete then a library, and a toolkit can simply be a collection of similar librarys and frameworks.
a really good question that is maybe even the slightest bit subjective in nature, but I believe that is about the best answer I could give.

Library
I think it's unanimous that a library is code already coded that you can use so as not to have to code it again. The code must be organized in a way that allows you to look up the functionality you want and use it from your own code.
Most programming languages come with standard libraries, especially some code that implements some kind of collection. This is always for the convenience that you don't have to code these things yourself. Similarly, most programming languages have construct to allow you to look up functionality from libraries, with things like dynamic linking, namespaces, etc.
So code that finds itself often needed to be re-used is great code to be put inside a library.
Toolkit
A set of tools used for a particular purpose. This is unanimous. The question is, what is considered a tool and what isn't. I'd say there's no fixed definition, it depends on the context of the thing calling itself a toolkit. Example of tools could be libraries, widgets, scripts, programs, editors, documentation, servers, debuggers, etc.
Another thing to note is the "particular purpose". This is always true, but the scope of the purpose can easily change based on who made the toolkit. So it can easily be a programmer's toolkit, or it can be a string parsing toolkit. One is so broad, it could have tool touching everything programming related, while the other is more precise.
SDKs are generally toolkits, in that they try and bundle a set of tools (often of multiple kind) into a single package.
I think the common thread is that a tool does something for you, either completely, or it helps you do it. And a toolkit is simply a set of tools which all perform or help you perform a particular set of activities.
Framework
Frameworks aren't quite as unanimously defined. It seems to be a bit of a blanket term for anything that can frame your code. Which would mean: any structure that underlies or supports your code.
This implies that you build your code against a framework, whereas you build a library against your code.
But, it seems that sometimes the word framework is used in the same sense as toolkit or even library. The .Net Framework is mostly a toolkit, because it's composed of the FCL which is a library, and the CLR, which is a virtual machine. So you would consider it a toolkit to C# development on Windows. Mono being a toolkit for C# development on Linux. Yet they called it a framework. It makes sense to think of it this way too, since it kinds of frame your code, but a frame should more support and hold things together, then do any kind of work, so my opinion is this is not the way you should use the word.
And I think the industry is trying to move into having framework mean an already written program with missing pieces that you must provide or customize. Which I think is a good thing, since toolkit and library are great precise terms for other usages of "framework".

Framework: installed on you machine and allowing you to interact with it. without the framework you can't send programming commands to your machine
Library: aims to solve a certain problem (or several problems related to the same category)
Toolkit: a collection of many pieces of code that can solve multiple problems on multiple issues (just like a toolbox)

It's a little bit subjective I think. The toolkit is the easiest. It's just a bunch of methods, classes that can be use.
The library vs the framework question I make difference by the way to use them. I read somewhere the perfect answer a long time ago. The framework calls your code, but on the other hand your code calls the library.

In relation with the correct answer from Mittag:
a simple example. Let's say you implement the ISerializable interface (.Net) in one of your classes. You make use of the framework qualities of .Net then, rather than it's library qualities. You fill in the "white spots" (as mittag said) and you have the skeleton completed. You must know in advance how the framework is going to "react" with your code. Actually .net IS a framework, and here is where i disagree with the view of Mittag.
The full, complete answer to your question is given very lucidly in Chapter 19 (the whole chapter devoted to just this theme) of this book, which is a very good book by the way (not at all "just for Smalltalk").

Others have noted that .net may be both a framework and a library and a toolkit depending on which part you use but perhaps an example helps. Entity Framework for dealing with databases is a part of .net that does use the inversion of control pattern. You let it know your models it figures out what to do with them. As a programmer it requires you to understand "the mind of the framework", or more realistically the mind of the designer and what they are going to do with your inputs. datareader and related calls, on the other hand, are simply a tool to go get or put data to and from table/view and make it available to you. It would never understand how to take a parent child relationship and translate it from object to relational, you'd use multiple tools to do that. But you would have much more control on how that data was stored, when, transactions, etc.

Cross-platform and language (de)serialization

I'm looking for a way to serialize a bunch of C++ structs in the most convenient way so that the serialization is portable across C++ and Java (at a minimum) and across 32bit/64bit, big/little endian platforms. The structures to be serialized just contain data, i.e. they're pure data objects with no state or behavior.
The idea being that we serialize the structs into an octet blob that we can store in a database "generically" and be read out later on. Thus avoiding changing the database whenever a struct changes and also avoiding assigning each data member to a field - i.e. we only want one table to hold everything "generically" as a binary blob. This should make less work for developers and require less changes when structures change.
I've looked at boost.serialize but don't think there's a way to enable compatibility with Java. And likewise for inheriting Serializable in Java.
If there is a way to do it by starting with an IDL file that would be best as we already have IDL files that describe the structures.
Cheers in advance!

I stumbled here, having a very similar question. 6 years later, this might not be useful to you, but hopefully it will be to others.
There are a lot of alternatives, unfortunately with no clear winner (although one could argue that JSON is the clear winner). Even Google has released multiple competing technologies (all of them apparently being used internally):
FlatBuffers: this one seems to meet the requirements from the original question, has interesting benchmarks and supports some form of IDL (I'm personally not familiar with IDL)
Protocol Buffers: mentioned previously.
XFJSON: 5%-12% smaller than JSON.
Not to forget the alternatives posted in the other answers. Here are a few more:
YAML: JSON minus all the double quotes, but using indentation instead. It's more human readable, but probably less efficient, especially as it gets larger.
BSON (Binary JSON)
MessagePack (Another compacted JSON)
With so many variations, JSON is clearly the winner in terms of simplicity/convenience and cross-platform access. It has gained even more popularity in the last couple years, with the rise of JavaScript. A lot of people probably use that as a de-facto solution, without giving it much thought (that's what I originally did :P).
However, if size becomes an issue, but you prefer to keep things simple and not use one of the more advanced libraries, you could just compress JSON using zlib (that's what I'm doing now), or some other cross-platform algorithm (but that's a whole other topic).
To speed up JSON handling in C++, you could also use RapidJSON.

I'm surprised Jon Skeet hasn't already pounced on this one :-)
Protocol Buffers is pretty much designed for this sort of scenario -- passing structured data cross-language.
That said, if you're using a database the way you suggest, you really shouldn't be using a full-strength RDBMS like Oracle or SQL Server but rather a lightweight key-value store such as Berkeley DB or one of the many "cloud table" engines.

If I want to go really really cross language, I normally would suggest JSON, as the ease of javascript support and an abundance of libraries, as well as being human readable and modifiable (I prefer it to XML as I find it smaller in terms of chars, faster, and more readable). It's not the most efficient in terms of space, however, and a more machine readable format like protocol buffers or thrift would have advantages there (thrift can be made from an IDL, but it is also made for encoding services, so it could be heavier than you want).

You need ASN.1! (Some people refer to this as binary XML.) ASN.1 is very compact and thus ideal to transfer data between two systems. And for those who don't think this is ever used: several Internet protocols are based upon the ASN.1 model for data serialization!
Unfortunately, there aren't many libraries available for Java or C++ that will support ASN.1. I had to work with it several years ago and just couldn't find a good, free or inexpensive tool to allow support for ASN.1 in C++. At Objective Systems they are selling ASN.1/XML solutions but it's extremely expensive. (The ASN.1 compiler for C++ and Java, that is!) It costs you an arm and a leg at least! (But then you will have a tool that you can use with only one hand...)

I'd suggest saving the data with SQLite database. The structs can be stored as database rows in SQLite tables.
The resulting database file is binary compatible across many different platforms and can be stored as a BLOB in your main database. I believe the file size is comparable to compressed XML file with the same data, but memory usage during processing will be significantly less than XML DOM.

Why haven't you chosen XML, as this perfectly suits your demand. Both C++ and Java allow for an easy implementation.
Furthermore, I doubt your idea of storing everything as a blob in the database, use a relational database what a database has been designed for, or switch to some object oriented database like http://www.versant.com/en_US/products/objectdatabase which supports both Java and C++.

There is also Avro. Look this question for comparison of Apache thrift, protocol buffers, mes and so on.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008