I recently started reading and employing gRPC in my work. gRPC uses protocol-buffers internally as its IDL and I keep reading everywhere that protocol-buffers perform much better, faster as compared to JSON and XML.
What I fail to understand is - how do they do that? What design in protocol-buffers actually makes them perform faster compared to XML and JSON?
String representations of data:
require text encode/decode (which can be cheap, but is still an extra step)
requires complex parse code, especially if there are human-friendly rules like "must allow whitespace"
usually involves more bandwidth - so more actual payload to churn - due to embedding of things like names, and (again) having to deal with human-friendly representations (how to tokenize the syntax, for example)
often requires lots of intermediate string instances that are used for member-lookups etc
Both text-based and binary-based serializers can be fast and efficient (or slow and horrible)... just: binary serializers have the scales tipped in their advantage. This means that a "good" binary serializer will usually be faster than a "good" text-based serializer.
Let's compare a basic example of an integer:
json:
{"id":42}
9 bytes if we assume ASCII or UTF-8 encoding and no whitespace.
xml:
<id>42</id>
11 bytes if we assume ASCII or UTF-8 encoding and no whitespace - and no namespace noise like namespaces.
protobuf:
0x08 0x2a
2 bytes
Now imagine writing a general purpose xml or json parser, and all the ambiguities and scenarios you need to handle just at the text layer, then you need to map the text token "id" to a member, then you need to do an integer parse on "42". In protobuf, the payload is smaller, plus the math is simple, and the member-lookup is an integer (so: suitable for a very fast switch/jump).
While binary protocols have an advantage in theory, in practice, they can lose in performance to JSON or other protocol with textual representation depending on the implementation.
Efficient JSON parsers like RapidJSON or jsoniter-scala parse most JSON samples at speed 2-8 cycles per byte. They serialize even more efficiently, except some edge cases like numbers with floating points when serialization speed can drop down to 16-32 cycles per byte.
But for most domains which don't have a lot of floats or doubles their speed is quite competitive with the best binary serializers. Please see results of benchmarks where jsoniter-scala parses and serializes on par with Java and Scala libraries for ProtoBuf:
https://github.com/dkomanov/scala-serialization/pull/8
I'd have to argue that Binary Protocols will typically always win in performance vs text based protocols. Ha, you won't find many (or any) video streaming applications using JSON to represent the frame data. However, any poorly designed data structure will struggle when being parsed. I've worked on many communications projects to where the text based protocols were replaced with "binary protocols".
Related
I've noticed a lot of payload data encoded as Base64 before transmission in many IoT use cases. Particularly in LPWAN (LoRa, LTE-M, NBIoT Sigfox, etc).
For simplicity sake, sending JSON payloads makes a lot of sense. Also it's my understanding that Base64 encoding adds some additional weight to the payload size, which for low bandwidth use cases it seems counter intuitive.
Could someone explain the benefits of using Base64 in IoT (or otherwise) applications?
Thanks!
Well, base64 is generally used to encode binary formats. Since binary is the native representation of data in a computer, it is obviously the easiest format for resource-constrained embedded devices to handle. It's also reasonably compact.
As an algorithm, base64 is a fairly simple conceptually and requires very few resources to implement, so it's a good compromise for squeezing binary data through a text-only channel. Building a JSON record, on the other hand, typically requires a JSON library which consumes RAM and code space - not horribly much, but still more than base64.
Not to mention that the data channels you've mentioned are rather starved for bandwidth. E.g. public LoRaWAN deployments are notorious for permitting a device to send a few dozen bytes of data a few dozen times per day.
If I want to encode a data record consisting of, say a 32-bit timestamp, an 8-bit code specifying the type of data (i.e. temperature, voltage or pressure) and 32-bit data sample:
struct {
time_t time;
uint8_t type;
uint32_t value;
}
This will use 9 bytes. It grows to around 12 bytes after being encoded with base64.
Compare that with a simple JSON record which is 67 bytes after leaving out all whitespace:
{
"time": "2012-04-23T18:25:43.511Z",
"type": "temp",
"value": 26.94
}
So 12 B or 67 B - not much competition for bandwidth starved data channels. On a LoRaWAN link that could make the different between squeezing into your precious uplink slot 5-6 data records or 1 data record.
Regarding data compression - on a resource constrained embedded device it's much, much more practical to encode data as compact binary instead of transforming it into a verbose format and compressing that.
Are there major benefits of selecting NIfTi over DICOM (or viz.) as the choice of data format? I am working on 3D Volumetric semantic segmentation. I will have to convert either format to numpy array or tensor before feeding to the network, but curious on the performance benefits of selection.
(This question risks being opinion-based, so trying to stick to facts.)
DICOM is a very powerful, flexible but complex format, and its strength is to provide interoperability between different hardware and software. However, DICOM is not particularly efficient for image processing and analysis. One potential drawback of DICOM is that a single volume is stored as a sequence of 2D slices, which can be cumbersome to deal with.
NIfTi is an improved version of the Analyze file format, which was designed to be simpler than DICOM, while still retaining all the essential metadata. And it has the added benefit of being able to store a volume in a single file, with a simple header followed by raw data. This makes it fast to load and process.
There are several other medical file formats suitable for this task. You may also wish to consider NRRD which has many features in common with NIfTi. Simple format, fast to parse and load, flexible storage encoding for 2,3,4D data. Many tools and libraries can process NRRD files too.
So given your primary need is for efficient storage and analysis, NIfTi or NRRD would be a better choice.
It's known (see this answer here), that Couchbase provides binary data as base64-encoded document, when using MapReduce queries.
However, does it stores it as base64 too? From libcouchbase's perspective, it takes a byte array + length, does it gets converted to base64 later?
The Couchbase storage engine stores your data exactly as-as (i.e. the stream of bytes of the length you specify) internally. When reading that data using the CRUD key/value API at the protocol level you get back the exact same stream of bytes.
This is possible because the low-level key-value protocol is binary on the wire, and so there are no issues with using all 8 bits per byte.
Different Client SDKs will expose that in different ways to you. For example:
The C SDK (being low-level) directly gives you back a char* buffer and length.
The Python SDK provides a Transcoding feature where is uses a flag in the documents' metadata to encode the type of the document, so it can automatically convert it to the original type, for example a Python serialised object or a JSON object.
On the other hand, the Views API is done over HTTP with JSON response objects. JSON cannot directly encode 8-bit binary data, so Couchbase needs to use base64 encoding for the view response objects if they contain binary data.
(As an aside, this is one of the reasons why it is recommended to have an index emit the minimum amount of data needed, for example just the key of the document(s) of interest, then use the CRUD key/value interface to actually get the document - the Key/Value interface doesn't have the base64 overhead when transmitting data back.
I'am working on a small perl script. And I store the data using JSON.
I decode the JSON string using from_json an encode with to_json.
To be more specific:
The data scale could be something like 100,000 items in a hash
The data is stored in a file in the disk.
So to decode it, I'll have to read it from the disk first
And my question is:
There is a huge difference in the speed between the decoding and encoding process.
The encoding process seems to be much faster than the decoding process.
And I wonder what makes that difference ?
Parsing is much more computationally expensive than formatting.
from_json has to parse the json structures and convert them into perl data structures, to_json merely has to iterate through the data structure and "print" out each item in a formatted way.
Parsing is a complex topic that still is the focus of CS theory work. However at the base level, parsing is a 2 step operation. You need to parse the input stream for tokens and then validate the sequence of tokens as a valid statement in the language. Encoding is on the other hand a single step operation, you already know it's valid, you simply have to convert it to the representation.
JSON (the module) is not a parser/encoder. It's merely a front-end for JSON::XS (very fast) or JSON::PP (not so much). JSON will use JSON::XS if it's installed, but defaults to JSON::PP if it's not. You might see very different numbers depending on whether you have JSON::XS installed or not.
I could see a Perl parser (like JSON::PP) having varying performances for encoding and decoding because it's hard to write something optimal because of all the overhead, but the difference should be much smaller using JSON::XS.
It might still be a bit slower to decode using JSON::XS because of all the memory blocks it has to allocate. Allocating memory is a relatively expensive process, and it needs to be done fare more time when decoding than when encoding. For example, a Perl string consists of a three memory blocks (scalar head, scalar body and the string buffer itself). When encoding, allocating memory is only done when the output buffer needs to be enlarged.
There are claims that Scala's type system is Turing complete. My questions are:
Is there a formal proof for this?
How would a simple computation look like in the Scala type system?
Is this of any benefit to Scala - the language? Is this making Scala more "powerful" in some way compared languages without a Turing complete type system?
I guess this applies to languages and type systems in general.
There is a blog post somewhere with a type-level implementation of the SKI combinator calculus, which is known to be Turing-complete.
Turing-complete type systems have basically the same benefits and drawbacks that Turing-complete languages have: you can do anything, but you can prove very little. In particular, you cannot prove that you will actually eventually do something.
One example of type-level computation are the new type-preserving collection transformers in Scala 2.8. In Scala 2.8, methods like map, filter and so on are guaranteed to return a collection of the same type that they were called on. So, if you filter a Set[Int], you get back a Set[Int] and if you map a List[String] you get back a List[Whatever the return type of the anonymous function is].
Now, as you can see, map can actually transform the element type. So, what happens if the new element type cannot be represented with the original collection type? Example: a BitSet can only contain fixed-width integers. So, what happens if you have a BitSet[Short] and you map each number to its string representation?
someBitSet map { _.toString() }
The result would be a BitSet[String], but that's impossible. So, Scala chooses the most derived supertype of BitSet, which can hold a String, which in this case is a Set[String].
All of this computation is going on during compile time, or more precisely during type checking time, using type-level functions. Thus, it is statically guaranteed to be type-safe, even though the types are actually computed and thus not known at design time.
My blog post on encoding the SKI calculus in the Scala type system shows Turing completeness.
For some simple type level computations there are also some examples on how to encode the natural numbers and addition/multiplication.
Finally there is a great series of articles on type level programming over on Apocalisp's blog.