Please help to suggest some merits and demerits of Flatbuffers and CBOR protocols. Both these binary formats claim to be good on their websites, but I am not able to make some good differences between the two.
Flatbuffers:
Advantage:
Strict typing in FlatBuffer, Cap’n proto and other similar solutions is seen as major key point for performance since no additional encoding/decoding is necessary.
The data model allows simple offsetting of typed objects with a compact data structure and fast access
FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data often coupled with per-object memory allocation.
Disadvantage:
New and not standardized like CBOR.
CBOR
Advantage:
Can create and process entirely in stream with no extra memory
Don’t have to pre-define any schema as our data is dynamic and variant
It’s an open international standard from the IETF makes it a even better choice than a proprietary one.
It’s designed for low memory, non-conversion, stream-based processing while also providing extensions for other data types
Disadvantage:
CBOR says that it follows the JSON model (so not strictly typed objects)
It starts with the same types of objects (strings, integers, maps, etc.).
PS:
It feels like managing types in CBOR will be performance costly compared to flatbuffers, but as CBOR is standardized protocol I am inclined to prefer it if this difference is not huge. Please let me know which of two will you all recommend and why.
I think you've already spelled it out quite clearly yourself. FlatBuffer's strength is being able to access the data without parsing/unpacking/allocation, which can give serious performance benefits in some scenarios. But if this doesn't matter to you, e.g. Protocol Buffers may work just as well.
Strong typing vs dynamic typing in data matters a lot too. I'd only use the latter if I wanted generic data storage with no constraints ahead of time.
Btw, if for some reason you prefer dynamic typing, but would also like to have the performance benefits of in-place access, there is actually a format that combines the two: https://google.github.io/flatbuffers/flexbuffers.html
FlatBuffers is not "proprietary". It may have been designed at Google, but it is open source and relied upon by many other companies.
I chose CBOR for my site https://kwippe.com - we use it to store all of the artwork and keyword data as compressed strings within a very small JSON structure, only a few attributes necessary to categorize the file. So the files are very small, and load very fast. I used this for over 30,000 SVG files, which I converted to JSON beforehand. All of the JSON is converted to string and compressed via a string compression library, then saved as part of the smaller JSON object that I encode to CBOR.
I've had very few problems with this CBOR system, and it was far easier to set up than FlatBuffers and some of the other binary solutions that I looked at.
I had this same question and went with CBOR for a couple reasons.
You have a CON that CBOR like JSON doesn't have strict types, true, you'll need to do a little validation to make sure the type you got is one you expected. You're right, this is what a Schema serializer gets you. You lose flexibility of changing types, but you know what you're going to get. I work on embedded in C, and static typing is important.
What you didn't list as a PRO is that CBOR 'can' retain JSON compatibility. That any valid JSON is valid CBOR, but not the other way around. A cbor can have a map item (object, key/value pair) of 1 : 2 that's integer 1 name has the value of integer 2. This isn't great a practice but there could be some uses for it. If you avoid the intentionally incompatible things, CBOR >> JSON conversion can be very handy. When would you use that? Well, I use it for logs. When my CBOR packets hit our server, they are converted to JSON and stored away already human readable for analytics. You can do this with any serializer, but we felt there was a lot less chance for "interpretation" differences in the close conversion.
The main factor for us was the schema was too difficult to share, and synchronize. If you own both sides of an A to B system, a schema is great! You get size efficiency because the map "Apples" : 100 is just stored as [1,100] but you had to get your schema file on both sides and compiled in (if using code generation) before you could get any work done. Ok, but what if you have 10 sides in a star pattern A B C D E F G H I J, where A and J can get messages to each other, B and H almost exclusively chat except for a message that goes to E and never back from, etc... In this scenario a schema can be very difficult! Maybe it's working and you add a whole slew of messages the option is to have old schemas, optional or missing definitions, or you synchronize everyone. For us this was the case and it would have taken place over 4 languages and in systems we didn't own.
Instead, we chose schemaless CBOR and appropriately name each map item. "apples" is for A,B,C, and J. "bananas" is an item that will go to C, H and E but never F, etc. Each side needs to know what it should expect and that's all.
As I understand it, FlatBuffers does have a schema-less mode, but I know little about it. I don't think there is a right answer, but for what it's worth, our web developers took to and understood CBOR right away because it's so similar in look and feel to JSON.
UPDATE: If interested in CBOR, but could really use some schema support and/or a clear way to document what the expected data is. CDDL (RFC 8610) looks to do exactly this. Also supports data definition of JSON because of how similar CBOR and JSON can be. There are also CDDL code generation tools for various languages that will accept the CDDL file, and help generate code for deserializing, parsing, validating the CBOR/JSON data. For me, this was the largest pain point of not having a schema, I was left to do this work and make mistakes on my own.
Related
This is NOT a question about XML! This is a question about transforming binary data in a Google Protocol Buffer.
Let's say I have two .proto's generating two different "Messages". Imagine that in the one message all the units are metric, in the other they are all English. Aside from that names are all capitalized in the one and not the other.. and so on, and so on.
Now my question is:
How can I generically transform protocol buffer data in place WITHOUT either: (1) writing custom implementation to access a field in object A only to process it and mutate it into object B, or (2) pulling the data out of the proto namespace and paradigm (eg: stream to xml).
So far my solution has been moving data from protocol buffers through Xerces, transforming in Xalan and then streaming back into another object. Painful, clunky, slow.
Quite simply: there isn't anything comparable pre-existing of which I am aware. In theory something could be possible using the reader/writer APIs (for whichever platform you're targeting), but it still wouldn't be trivial, especially in the treatment of sub-objects.
It could be interesting to investigate such a transformation API, but I don't imagine it is going to be common-place enough to warrant anything as advanced as xslt.
I'm new to Immutable.js, so this is a very trivial question.
It looks like I can't get a Map value like with plain old Javascript objects, e.g. myMap.myKey. Apparently I have to write myMap.get('myKey')
I am very surprised by this behavior. Is there a reason for that? Is there any extension to Immutable.js which would allow me to type myMap.myKey?
Came back to elaborate on my comment, but SO doesn't allow that after certain time. Converting it into an answer.
The question you have asked has been reciprocated several times with people who start new with immutable, yours truly included. Its on one of the rants I wrote a while ago.
It starts to make sense when you look at it from immutability perspective. If you expose value types as your own properties, they won't be immutable because they are value types and could be assigned to.
Nonetheless, its frustrating to spread these getters all across your components/views. If you can afford it, you should try to use the Record type. It offers traditional access to members (except in IE 8). Better still, you can extend from this type and add helper getters/setters (e.g. user.getName(), user.setName('thebat') instead of user.get('name')/set('name', 'thebat')) to abstract your model's internal structure from your views. However there are challenges to overcome like nested structures and de-serialization of objects.
If the above is not your cup of tea, I'd recommend swallowing the bitter pill :).
I think you are missing the concept Immutable was build:
Immutable data cannot be changed once created, leading to much simpler
application development, no defensive copying, and enabling advanced
memoization and change detection techniques with simple logic.
Persistent data presents a mutative API which does not update the data
in-place, but instead always yields new updated data.
One way or another you may transform Immutable data structures to plain old JS objects as: myMap.toJS()
I'm thinking of introducing a strongly typed (read - with predefined schema) data interchange format for communication between our internal services. For example, I guess something like Thrift or Cap'n Proto.
At least two obvious advantages (to me) of using this over something like JSON is that
you would KNOW the exact format of the data the service can expect (so leaves less room for ambiguity and errors while communicating) and
the implementation generally deserializes the raw message for you and it provides methods for accessing the objects.
What are the practical disadvantages for going this route, versus something like JSON?
For context - our system consists of services written in python and java - and possibly other languages in the future, and communicates via HTTP endpoints between services and message brokers like rabbitmq.
As with every strongly typed system, one of the major advantanges is without a doubt that if you make mistakes, it fails early in the process, typically at the compilation stage, which is a good thing.
Second biggest advantage IMHO is what you already said: because the fields and types are well known, the compiler, libraries and related code know what data to expect and thus can be written/organized in a more efficient manner - or in short: performance.
In contrast, a losely typed system (like Avro), while allowing for much greater flexibility without the need of recompiling, comes with the other side of the same coin: the downside of being prone to errors regarding the contents of the message at runtime.
This is because a losely defined system defines only the syntax of a valid document (like for example XML) and leaves the message-level semantics of what's in the document up to the upper layers. A strongly typed system has the knowledge about those message-level semantics already built in at compile time. Therefore, it is easy to detect/decide whether a particular document or message is not only well-formed but valid with regard to the message contents. If you need to do the same with the losely defined system, you need to provide additional information at runtime (like XML schema) and validate your document against it.
Bottom line
What system you prefer is more or less a matter of taste in most cases. I'd make the decision based on the question, how variable the data are that I have to deal with. If it makes sense to use a strongly typed system, I'd go that way, because I like it very much to get informed about errors and mistakes early.
However, if there is a need for very flexible data structures, it may make more sense to go the other road. Although designing a losely typed schema on top of a strongly typed system is surely possible, it is somewhat contradicting and you'll end up with some overly complicated, while overly generic, thing.
Typed
Incoming messages that are type tagged is very liberating, so long as it's possible to tell what the incoming message is without reading all of it. If so then you no longer care so much about message order. This is because it's easy for the recipient of the messages to handle whatever it is sent. So you can have an application which just sits there taking whatever it gets, and just does whatever is appropriate for each one.
Format
A schema language that allows you to define value and size constraints is very useful. It means that the sender of a message cannot accidentally send an invalid one. Moreover the receiver can automatically tell if an incoming message meets the schema. This is a real bonus in implementing a network service; the vast bulk of the message validation is done for you!
By size constraint, I mean that you can specify how long an array is in the schema and the generated code will refuse to handle arrays longer or shorter. By value constraints, imagine a message field called "bearing"; you might want to constrain that to be between 0 and 359.
These both allow you to make a clear, unambiguous statement about what the interface is and have it enforced automatically. How many security bugs have there been recently where some network interface data validation has been badly implemented...
Options
One serialisation standard that does all this is ASN.1. The tools I've used take an ASN.1 schema and produce code to serialise and deserialise, automatically checking that the value and size constraints have been met and also telling you what an incoming message type is. The tools for ASN.1 can be quite elderly and are in need of updating. If updated it would be ideal for every purpose, with both binary and text wire formats available.
There's now JSON schemas too, and they seem to have type, value and size constraints. This might be what you're looking for.
I'm fairly sure that Google Protocol Buffers doesn't do type tagging very well, and doesn't do value and size constraints. I've seen comments in GPB schema along the lines of:
// musn't be greater than 10.
If that's what is being written into a schema, the schema language is arguably inadequate...
I'm not sure of Thrift, I'm not sure it does value constraints (someone correct me if I'm wrong please!).
Disadvantages
Can't think of any! It can irritate developers; code they thought was good can be readily revealed to be producing junk messages, which annoys them intensely...
The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.
I'm looking for a way to serialize a bunch of C++ structs in the most convenient way so that the serialization is portable across C++ and Java (at a minimum) and across 32bit/64bit, big/little endian platforms. The structures to be serialized just contain data, i.e. they're pure data objects with no state or behavior.
The idea being that we serialize the structs into an octet blob that we can store in a database "generically" and be read out later on. Thus avoiding changing the database whenever a struct changes and also avoiding assigning each data member to a field - i.e. we only want one table to hold everything "generically" as a binary blob. This should make less work for developers and require less changes when structures change.
I've looked at boost.serialize but don't think there's a way to enable compatibility with Java. And likewise for inheriting Serializable in Java.
If there is a way to do it by starting with an IDL file that would be best as we already have IDL files that describe the structures.
Cheers in advance!
I stumbled here, having a very similar question. 6 years later, this might not be useful to you, but hopefully it will be to others.
There are a lot of alternatives, unfortunately with no clear winner (although one could argue that JSON is the clear winner). Even Google has released multiple competing technologies (all of them apparently being used internally):
FlatBuffers: this one seems to meet the requirements from the original question, has interesting benchmarks and supports some form of IDL (I'm personally not familiar with IDL)
Protocol Buffers: mentioned previously.
XFJSON: 5%-12% smaller than JSON.
Not to forget the alternatives posted in the other answers. Here are a few more:
YAML: JSON minus all the double quotes, but using indentation instead. It's more human readable, but probably less efficient, especially as it gets larger.
BSON (Binary JSON)
MessagePack (Another compacted JSON)
With so many variations, JSON is clearly the winner in terms of simplicity/convenience and cross-platform access. It has gained even more popularity in the last couple years, with the rise of JavaScript. A lot of people probably use that as a de-facto solution, without giving it much thought (that's what I originally did :P).
However, if size becomes an issue, but you prefer to keep things simple and not use one of the more advanced libraries, you could just compress JSON using zlib (that's what I'm doing now), or some other cross-platform algorithm (but that's a whole other topic).
To speed up JSON handling in C++, you could also use RapidJSON.
I'm surprised Jon Skeet hasn't already pounced on this one :-)
Protocol Buffers is pretty much designed for this sort of scenario -- passing structured data cross-language.
That said, if you're using a database the way you suggest, you really shouldn't be using a full-strength RDBMS like Oracle or SQL Server but rather a lightweight key-value store such as Berkeley DB or one of the many "cloud table" engines.
If I want to go really really cross language, I normally would suggest JSON, as the ease of javascript support and an abundance of libraries, as well as being human readable and modifiable (I prefer it to XML as I find it smaller in terms of chars, faster, and more readable). It's not the most efficient in terms of space, however, and a more machine readable format like protocol buffers or thrift would have advantages there (thrift can be made from an IDL, but it is also made for encoding services, so it could be heavier than you want).
You need ASN.1! (Some people refer to this as binary XML.) ASN.1 is very compact and thus ideal to transfer data between two systems. And for those who don't think this is ever used: several Internet protocols are based upon the ASN.1 model for data serialization!
Unfortunately, there aren't many libraries available for Java or C++ that will support ASN.1. I had to work with it several years ago and just couldn't find a good, free or inexpensive tool to allow support for ASN.1 in C++. At Objective Systems they are selling ASN.1/XML solutions but it's extremely expensive. (The ASN.1 compiler for C++ and Java, that is!) It costs you an arm and a leg at least! (But then you will have a tool that you can use with only one hand...)
I'd suggest saving the data with SQLite database. The structs can be stored as database rows in SQLite tables.
The resulting database file is binary compatible across many different platforms and can be stored as a BLOB in your main database. I believe the file size is comparable to compressed XML file with the same data, but memory usage during processing will be significantly less than XML DOM.
Why haven't you chosen XML, as this perfectly suits your demand. Both C++ and Java allow for an easy implementation.
Furthermore, I doubt your idea of storing everything as a blob in the database, use a relational database what a database has been designed for, or switch to some object oriented database like http://www.versant.com/en_US/products/objectdatabase which supports both Java and C++.
There is also Avro. Look this question for comparison of Apache thrift, protocol buffers, mes and so on.