Couchbase: is there a base64-encoding overhead on storing binary data?

Couchbase: is there a base64-encoding overhead on storing binary data? - couchbase

It's known (see this answer here), that Couchbase provides binary data as base64-encoded document, when using MapReduce queries.
However, does it stores it as base64 too? From libcouchbase's perspective, it takes a byte array + length, does it gets converted to base64 later?

The Couchbase storage engine stores your data exactly as-as (i.e. the stream of bytes of the length you specify) internally. When reading that data using the CRUD key/value API at the protocol level you get back the exact same stream of bytes.
This is possible because the low-level key-value protocol is binary on the wire, and so there are no issues with using all 8 bits per byte.
Different Client SDKs will expose that in different ways to you. For example:
The C SDK (being low-level) directly gives you back a char* buffer and length.
The Python SDK provides a Transcoding feature where is uses a flag in the documents' metadata to encode the type of the document, so it can automatically convert it to the original type, for example a Python serialised object or a JSON object.
On the other hand, the Views API is done over HTTP with JSON response objects. JSON cannot directly encode 8-bit binary data, so Couchbase needs to use base64 encoding for the view response objects if they contain binary data.
(As an aside, this is one of the reasons why it is recommended to have an index emit the minimum amount of data needed, for example just the key of the document(s) of interest, then use the CRUD key/value interface to actually get the document - the Key/Value interface doesn't have the base64 overhead when transmitting data back.

Related

Why use Base64 in IoT use cases?

I've noticed a lot of payload data encoded as Base64 before transmission in many IoT use cases. Particularly in LPWAN (LoRa, LTE-M, NBIoT Sigfox, etc).
For simplicity sake, sending JSON payloads makes a lot of sense. Also it's my understanding that Base64 encoding adds some additional weight to the payload size, which for low bandwidth use cases it seems counter intuitive.
Could someone explain the benefits of using Base64 in IoT (or otherwise) applications?
Thanks!

Well, base64 is generally used to encode binary formats. Since binary is the native representation of data in a computer, it is obviously the easiest format for resource-constrained embedded devices to handle. It's also reasonably compact.
As an algorithm, base64 is a fairly simple conceptually and requires very few resources to implement, so it's a good compromise for squeezing binary data through a text-only channel. Building a JSON record, on the other hand, typically requires a JSON library which consumes RAM and code space - not horribly much, but still more than base64.
Not to mention that the data channels you've mentioned are rather starved for bandwidth. E.g. public LoRaWAN deployments are notorious for permitting a device to send a few dozen bytes of data a few dozen times per day.
If I want to encode a data record consisting of, say a 32-bit timestamp, an 8-bit code specifying the type of data (i.e. temperature, voltage or pressure) and 32-bit data sample:
struct {
time_t time;
uint8_t type;
uint32_t value;
}
This will use 9 bytes. It grows to around 12 bytes after being encoded with base64.
Compare that with a simple JSON record which is 67 bytes after leaving out all whitespace:
{
"time": "2012-04-23T18:25:43.511Z",
"type": "temp",
"value": 26.94
}
So 12 B or 67 B - not much competition for bandwidth starved data channels. On a LoRaWAN link that could make the different between squeezing into your precious uplink slot 5-6 data records or 1 data record.
Regarding data compression - on a resource constrained embedded device it's much, much more practical to encode data as compact binary instead of transforming it into a verbose format and compressing that.

How are protocol-buffers faster than XML and JSON?

I recently started reading and employing gRPC in my work. gRPC uses protocol-buffers internally as its IDL and I keep reading everywhere that protocol-buffers perform much better, faster as compared to JSON and XML.
What I fail to understand is - how do they do that? What design in protocol-buffers actually makes them perform faster compared to XML and JSON?

String representations of data:
require text encode/decode (which can be cheap, but is still an extra step)
requires complex parse code, especially if there are human-friendly rules like "must allow whitespace"
usually involves more bandwidth - so more actual payload to churn - due to embedding of things like names, and (again) having to deal with human-friendly representations (how to tokenize the syntax, for example)
often requires lots of intermediate string instances that are used for member-lookups etc
Both text-based and binary-based serializers can be fast and efficient (or slow and horrible)... just: binary serializers have the scales tipped in their advantage. This means that a "good" binary serializer will usually be faster than a "good" text-based serializer.
Let's compare a basic example of an integer:
json:
{"id":42}
9 bytes if we assume ASCII or UTF-8 encoding and no whitespace.
xml:
<id>42</id>
11 bytes if we assume ASCII or UTF-8 encoding and no whitespace - and no namespace noise like namespaces.
protobuf:
0x08 0x2a
2 bytes
Now imagine writing a general purpose xml or json parser, and all the ambiguities and scenarios you need to handle just at the text layer, then you need to map the text token "id" to a member, then you need to do an integer parse on "42". In protobuf, the payload is smaller, plus the math is simple, and the member-lookup is an integer (so: suitable for a very fast switch/jump).

While binary protocols have an advantage in theory, in practice, they can lose in performance to JSON or other protocol with textual representation depending on the implementation.
Efficient JSON parsers like RapidJSON or jsoniter-scala parse most JSON samples at speed 2-8 cycles per byte. They serialize even more efficiently, except some edge cases like numbers with floating points when serialization speed can drop down to 16-32 cycles per byte.
But for most domains which don't have a lot of floats or doubles their speed is quite competitive with the best binary serializers. Please see results of benchmarks where jsoniter-scala parses and serializes on par with Java and Scala libraries for ProtoBuf:
https://github.com/dkomanov/scala-serialization/pull/8

I'd have to argue that Binary Protocols will typically always win in performance vs text based protocols. Ha, you won't find many (or any) video streaming applications using JSON to represent the frame data. However, any poorly designed data structure will struggle when being parsed. I've worked on many communications projects to where the text based protocols were replaced with "binary protocols".

Is it possible to remove start codes using NVENC?

I'm using NVENC SDK to encode OpenGL frames and stream them over RTSP. NVENC gives me encoded data in the form of several NAL units. In order to stream them with Live555 I need to find the start code (0x00 0x00 0x01) and remove it. I want to avoid this operation.
NVENC has a sliceOffset attribute which I can consult, but it indicates slices, not NAL units. It only points the ending of the SPS and PPS headers, where the actual data starts. I understand that a slice is not equal to a NAL (correct me if I'm wrong). I'm already forcing single slices for encoded data.
Is any of the following possible?
Force NVENC to encode individual NAL units
Force NVENC to indicate where the NAL units in each encoded data block are
Make Live555 accept the sequence parameters for streaming

There seems to be a point where every person trying to do H.264 over RTSP/RTP comes down to this question. Well here are my two cents:
1) There is a concept of an access unit. An access unit is a set of NAL units (may be as well be only one) that represent an encoded frame. That is the level of logic you should work at. If you are saying that you want the encoder to give you individual NAL unit's, then what behavior do you expect when the encoding procedure results in multiple NAL units from one raw frame (e.g. SPS + PPS + coded picture). That being said, there are ways to configure the encoder to reduce the number of NAL units in an access unit (like not including the AUD NAL, not repeating SPS/PPS, exclude SEI NAL's) - with that knowledge you can actually know what to expect and kind of force the encoder to give you single NAL per frame (of course this will not work for all frames, but with the knowledge you have about the decoder you can handle that). I'm not an expert on the NVENC API, I've also just started using it, but at least as for Intel Quick Sync, turning off AUD,SEI and disabling repetition of PPS/SPS gave me roughly around 1 NAL per frame for frames 2...N.
2) Won't be able to answer this since as I mentioned I'm not familiar with the API but I highly doubt this.
3) SPS and PPS should be in the first access unit (the first bit-stream you get from the encoder) and you could just find the right NAL's in the bit-stream and extract them, or there may be a special API call to obtain them from the encoder.
All that being said, I don't think it is that hard to actually run through the bit-stream, parse the start codes and extract the NAL unit's and feed them to Live555 one by one. Of course, if the encoder offers to output the bit-stream in the AVCC format (compared to the start codes or Annex B it uses interleaved length value between the NAL units so you can just jump to the next one without looking for the prefix) then you should use it. When it is just RTP it's easy enough to implement the transport yourself, since I've had bad luck with GStreamer that did not have proper support for FU-A packetization, in case of RTSP the overhead of the transport infrastructure is bigger and it is reasonable to use a 3rd party library like Live555.

speed of decoding and encoding json in perl

I'am working on a small perl script. And I store the data using JSON.
I decode the JSON string using from_json an encode with to_json.
To be more specific:
The data scale could be something like 100,000 items in a hash
The data is stored in a file in the disk.
So to decode it, I'll have to read it from the disk first
And my question is:
There is a huge difference in the speed between the decoding and encoding process.
The encoding process seems to be much faster than the decoding process.
And I wonder what makes that difference ?

Parsing is much more computationally expensive than formatting.
from_json has to parse the json structures and convert them into perl data structures, to_json merely has to iterate through the data structure and "print" out each item in a formatted way.
Parsing is a complex topic that still is the focus of CS theory work. However at the base level, parsing is a 2 step operation. You need to parse the input stream for tokens and then validate the sequence of tokens as a valid statement in the language. Encoding is on the other hand a single step operation, you already know it's valid, you simply have to convert it to the representation.

JSON (the module) is not a parser/encoder. It's merely a front-end for JSON::XS (very fast) or JSON::PP (not so much). JSON will use JSON::XS if it's installed, but defaults to JSON::PP if it's not. You might see very different numbers depending on whether you have JSON::XS installed or not.
I could see a Perl parser (like JSON::PP) having varying performances for encoding and decoding because it's hard to write something optimal because of all the overhead, but the difference should be much smaller using JSON::XS.
It might still be a bit slower to decode using JSON::XS because of all the memory blocks it has to allocate. Allocating memory is a relatively expensive process, and it needs to be done fare more time when decoding than when encoding. For example, a Perl string consists of a three memory blocks (scalar head, scalar body and the string buffer itself). When encoding, allocating memory is only done when the output buffer needs to be enlarged.

What does the oauth guide mean by "8 bit array"?

The beginners guide for oauth says the following:
Binary data is not directly handled by
the OAuth specification but is assumed
to be stored in an 8bit array which is
not UTF-8 encoded.
I don't understand what is meant by this? How do you store binary in an 8bit array? The wikipedia article on bit array didn't help me.

8-bit array most likely means an array with byte-sized elements or, an array of bytes. Where a byte consists of 8-bits, or one octet. The data region that the array encompasses is then said to be byte-addressable.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008