The beginners guide for oauth says the following:
Binary data is not directly handled by
the OAuth specification but is assumed
to be stored in an 8bit array which is
not UTF-8 encoded.
I don't understand what is meant by this? How do you store binary in an 8bit array? The wikipedia article on bit array didn't help me.
8-bit array most likely means an array with byte-sized elements or, an array of bytes. Where a byte consists of 8-bits, or one octet. The data region that the array encompasses is then said to be byte-addressable.
Related
I've noticed a lot of payload data encoded as Base64 before transmission in many IoT use cases. Particularly in LPWAN (LoRa, LTE-M, NBIoT Sigfox, etc).
For simplicity sake, sending JSON payloads makes a lot of sense. Also it's my understanding that Base64 encoding adds some additional weight to the payload size, which for low bandwidth use cases it seems counter intuitive.
Could someone explain the benefits of using Base64 in IoT (or otherwise) applications?
Thanks!
Well, base64 is generally used to encode binary formats. Since binary is the native representation of data in a computer, it is obviously the easiest format for resource-constrained embedded devices to handle. It's also reasonably compact.
As an algorithm, base64 is a fairly simple conceptually and requires very few resources to implement, so it's a good compromise for squeezing binary data through a text-only channel. Building a JSON record, on the other hand, typically requires a JSON library which consumes RAM and code space - not horribly much, but still more than base64.
Not to mention that the data channels you've mentioned are rather starved for bandwidth. E.g. public LoRaWAN deployments are notorious for permitting a device to send a few dozen bytes of data a few dozen times per day.
If I want to encode a data record consisting of, say a 32-bit timestamp, an 8-bit code specifying the type of data (i.e. temperature, voltage or pressure) and 32-bit data sample:
struct {
time_t time;
uint8_t type;
uint32_t value;
}
This will use 9 bytes. It grows to around 12 bytes after being encoded with base64.
Compare that with a simple JSON record which is 67 bytes after leaving out all whitespace:
{
"time": "2012-04-23T18:25:43.511Z",
"type": "temp",
"value": 26.94
}
So 12 B or 67 B - not much competition for bandwidth starved data channels. On a LoRaWAN link that could make the different between squeezing into your precious uplink slot 5-6 data records or 1 data record.
Regarding data compression - on a resource constrained embedded device it's much, much more practical to encode data as compact binary instead of transforming it into a verbose format and compressing that.
I recently started reading and employing gRPC in my work. gRPC uses protocol-buffers internally as its IDL and I keep reading everywhere that protocol-buffers perform much better, faster as compared to JSON and XML.
What I fail to understand is - how do they do that? What design in protocol-buffers actually makes them perform faster compared to XML and JSON?
String representations of data:
require text encode/decode (which can be cheap, but is still an extra step)
requires complex parse code, especially if there are human-friendly rules like "must allow whitespace"
usually involves more bandwidth - so more actual payload to churn - due to embedding of things like names, and (again) having to deal with human-friendly representations (how to tokenize the syntax, for example)
often requires lots of intermediate string instances that are used for member-lookups etc
Both text-based and binary-based serializers can be fast and efficient (or slow and horrible)... just: binary serializers have the scales tipped in their advantage. This means that a "good" binary serializer will usually be faster than a "good" text-based serializer.
Let's compare a basic example of an integer:
json:
{"id":42}
9 bytes if we assume ASCII or UTF-8 encoding and no whitespace.
xml:
<id>42</id>
11 bytes if we assume ASCII or UTF-8 encoding and no whitespace - and no namespace noise like namespaces.
protobuf:
0x08 0x2a
2 bytes
Now imagine writing a general purpose xml or json parser, and all the ambiguities and scenarios you need to handle just at the text layer, then you need to map the text token "id" to a member, then you need to do an integer parse on "42". In protobuf, the payload is smaller, plus the math is simple, and the member-lookup is an integer (so: suitable for a very fast switch/jump).
While binary protocols have an advantage in theory, in practice, they can lose in performance to JSON or other protocol with textual representation depending on the implementation.
Efficient JSON parsers like RapidJSON or jsoniter-scala parse most JSON samples at speed 2-8 cycles per byte. They serialize even more efficiently, except some edge cases like numbers with floating points when serialization speed can drop down to 16-32 cycles per byte.
But for most domains which don't have a lot of floats or doubles their speed is quite competitive with the best binary serializers. Please see results of benchmarks where jsoniter-scala parses and serializes on par with Java and Scala libraries for ProtoBuf:
https://github.com/dkomanov/scala-serialization/pull/8
I'd have to argue that Binary Protocols will typically always win in performance vs text based protocols. Ha, you won't find many (or any) video streaming applications using JSON to represent the frame data. However, any poorly designed data structure will struggle when being parsed. I've worked on many communications projects to where the text based protocols were replaced with "binary protocols".
It's known (see this answer here), that Couchbase provides binary data as base64-encoded document, when using MapReduce queries.
However, does it stores it as base64 too? From libcouchbase's perspective, it takes a byte array + length, does it gets converted to base64 later?
The Couchbase storage engine stores your data exactly as-as (i.e. the stream of bytes of the length you specify) internally. When reading that data using the CRUD key/value API at the protocol level you get back the exact same stream of bytes.
This is possible because the low-level key-value protocol is binary on the wire, and so there are no issues with using all 8 bits per byte.
Different Client SDKs will expose that in different ways to you. For example:
The C SDK (being low-level) directly gives you back a char* buffer and length.
The Python SDK provides a Transcoding feature where is uses a flag in the documents' metadata to encode the type of the document, so it can automatically convert it to the original type, for example a Python serialised object or a JSON object.
On the other hand, the Views API is done over HTTP with JSON response objects. JSON cannot directly encode 8-bit binary data, so Couchbase needs to use base64 encoding for the view response objects if they contain binary data.
(As an aside, this is one of the reasons why it is recommended to have an index emit the minimum amount of data needed, for example just the key of the document(s) of interest, then use the CRUD key/value interface to actually get the document - the Key/Value interface doesn't have the base64 overhead when transmitting data back.
I'am working on a small perl script. And I store the data using JSON.
I decode the JSON string using from_json an encode with to_json.
To be more specific:
The data scale could be something like 100,000 items in a hash
The data is stored in a file in the disk.
So to decode it, I'll have to read it from the disk first
And my question is:
There is a huge difference in the speed between the decoding and encoding process.
The encoding process seems to be much faster than the decoding process.
And I wonder what makes that difference ?
Parsing is much more computationally expensive than formatting.
from_json has to parse the json structures and convert them into perl data structures, to_json merely has to iterate through the data structure and "print" out each item in a formatted way.
Parsing is a complex topic that still is the focus of CS theory work. However at the base level, parsing is a 2 step operation. You need to parse the input stream for tokens and then validate the sequence of tokens as a valid statement in the language. Encoding is on the other hand a single step operation, you already know it's valid, you simply have to convert it to the representation.
JSON (the module) is not a parser/encoder. It's merely a front-end for JSON::XS (very fast) or JSON::PP (not so much). JSON will use JSON::XS if it's installed, but defaults to JSON::PP if it's not. You might see very different numbers depending on whether you have JSON::XS installed or not.
I could see a Perl parser (like JSON::PP) having varying performances for encoding and decoding because it's hard to write something optimal because of all the overhead, but the difference should be much smaller using JSON::XS.
It might still be a bit slower to decode using JSON::XS because of all the memory blocks it has to allocate. Allocating memory is a relatively expensive process, and it needs to be done fare more time when decoding than when encoding. For example, a Perl string consists of a three memory blocks (scalar head, scalar body and the string buffer itself). When encoding, allocating memory is only done when the output buffer needs to be enlarged.
I'm working on storing a JSON snippet onto a type 31 QR Code so that I can scan it with a smartphone and parse the JSON.
I'm running into a few challenges..
A type 31 QR Code is the "densest" (for lack of better words) code that I can get my Android device to reliably scan. This can store 2677 Alphanumeric characters factoring in 7% error correction.
What are my options for compressing my optimized/minified JSON object and encoding a QR code with it? Conceivably, how much more data can I store? Or am I even barking up the right tree?
It all depends, really.
Is Wi-Fi available? If so, put your JSON snippets on a web server and encode their URLs in the QR codes. Problem solved.
If this is for general consumption, then you need to be aware that some phones are better than others. Mine really struggled to scan a version 25 QR code. I'd consider anything higher than version 20 as unreliable.
There's little benefit in using the alphanumeric mode. It only stores uppercase letters, digits 0-9 and a handful of punctuation marks. At 5½ bits per character (11 bits per pair), its storage capacity is almost identical to the corresponding binary mode (8 bits per character).
In a quick test gzip -n -9 reduced a 545-byte JSON file to 219 bytes (40% of the original size). You could do a lot better than this if you stored your data in a compact binary format instead of a verbose tagged format.
If you're putting these QR codes out in public, you'll need to include some sort of authentication mechanism (e.g., a 32-bit checksum) to prevent malicious code injections and other tomfoolery.