I'am working on a small perl script. And I store the data using JSON.
I decode the JSON string using from_json an encode with to_json.
To be more specific:
The data scale could be something like 100,000 items in a hash
The data is stored in a file in the disk.
So to decode it, I'll have to read it from the disk first
And my question is:
There is a huge difference in the speed between the decoding and encoding process.
The encoding process seems to be much faster than the decoding process.
And I wonder what makes that difference ?
Parsing is much more computationally expensive than formatting.
from_json has to parse the json structures and convert them into perl data structures, to_json merely has to iterate through the data structure and "print" out each item in a formatted way.
Parsing is a complex topic that still is the focus of CS theory work. However at the base level, parsing is a 2 step operation. You need to parse the input stream for tokens and then validate the sequence of tokens as a valid statement in the language. Encoding is on the other hand a single step operation, you already know it's valid, you simply have to convert it to the representation.
JSON (the module) is not a parser/encoder. It's merely a front-end for JSON::XS (very fast) or JSON::PP (not so much). JSON will use JSON::XS if it's installed, but defaults to JSON::PP if it's not. You might see very different numbers depending on whether you have JSON::XS installed or not.
I could see a Perl parser (like JSON::PP) having varying performances for encoding and decoding because it's hard to write something optimal because of all the overhead, but the difference should be much smaller using JSON::XS.
It might still be a bit slower to decode using JSON::XS because of all the memory blocks it has to allocate. Allocating memory is a relatively expensive process, and it needs to be done fare more time when decoding than when encoding. For example, a Perl string consists of a three memory blocks (scalar head, scalar body and the string buffer itself). When encoding, allocating memory is only done when the output buffer needs to be enlarged.
Related
I've noticed a lot of payload data encoded as Base64 before transmission in many IoT use cases. Particularly in LPWAN (LoRa, LTE-M, NBIoT Sigfox, etc).
For simplicity sake, sending JSON payloads makes a lot of sense. Also it's my understanding that Base64 encoding adds some additional weight to the payload size, which for low bandwidth use cases it seems counter intuitive.
Could someone explain the benefits of using Base64 in IoT (or otherwise) applications?
Thanks!
Well, base64 is generally used to encode binary formats. Since binary is the native representation of data in a computer, it is obviously the easiest format for resource-constrained embedded devices to handle. It's also reasonably compact.
As an algorithm, base64 is a fairly simple conceptually and requires very few resources to implement, so it's a good compromise for squeezing binary data through a text-only channel. Building a JSON record, on the other hand, typically requires a JSON library which consumes RAM and code space - not horribly much, but still more than base64.
Not to mention that the data channels you've mentioned are rather starved for bandwidth. E.g. public LoRaWAN deployments are notorious for permitting a device to send a few dozen bytes of data a few dozen times per day.
If I want to encode a data record consisting of, say a 32-bit timestamp, an 8-bit code specifying the type of data (i.e. temperature, voltage or pressure) and 32-bit data sample:
struct {
time_t time;
uint8_t type;
uint32_t value;
}
This will use 9 bytes. It grows to around 12 bytes after being encoded with base64.
Compare that with a simple JSON record which is 67 bytes after leaving out all whitespace:
{
"time": "2012-04-23T18:25:43.511Z",
"type": "temp",
"value": 26.94
}
So 12 B or 67 B - not much competition for bandwidth starved data channels. On a LoRaWAN link that could make the different between squeezing into your precious uplink slot 5-6 data records or 1 data record.
Regarding data compression - on a resource constrained embedded device it's much, much more practical to encode data as compact binary instead of transforming it into a verbose format and compressing that.
I have a process that logs newline-delimited JSON at large rates (50-100 MB/s) across several instances which is logrotated out and gzipped. Those logs then run through a validation step that ensures that the file is valid for my ETL workflow by doing basic checks on things like gzip truncation. The problem is that due to some yet-unresolved issue, that gzip file, in rare cases, contains corrupted/interrupted JSON blobs, for example:
{
"firstAttribute": "foo1",
"secondAttribute": "bar1",
"thirdAttribute": "ba{"firstAttribute": "foo2",
"secondAttribute": "bar2",
"thirdAttribute": "baz2"}
When these corrupted values hit the ETL flow, it brings the whole thing to a halt until I can identify exactly which of the thousands of files is causing the problem.
So far, the fastest way I've found to detect this corrupted JSON is by using jq:
zcat $file | if jq -e . > /dev/null 2>&1; then mv "$file" /good/; else mv "$file" /bad/; fi; done
Unfortunately, due to the size of the files, this check adds 1-2 seconds per file; at that rate, new files are created more quickly than they can be validated.
I'm looking for a solution that does a JSON corruption check as quickly as possible; false positives are ok and can be dealt with using the slower check, but any false negatives mean that I may as well not have checked at all. All possible solutions are welcome, any language, as long as they can be run on a Linux instance. Otherwise, I'm going to need to figure out running checks in parallel which may involve more CPU than the system generating the logs in the first place.
EDITED TO ADD
There was a suggestion to try using simd as an alternative parsing engine. I tried a very basic prototype, below, but it actually added 1-2s per file over the jq method.
#include "simdjson.h"
#include <iostream>
#include <string>
using namespace simdjson;
int main(void) {
ondemand::parser parser;
for (std::string line; std::getline(std::cin, line);) {
padded_string json = padded_string(line);
ondemand::document doc = parser.iterate(json);
}
}
Regarding the Json library used, the actual text parsing can be the most expensive part of decoding a Json document, but not always. Creating/Allocating Json object can be quite expensive too. String decoding is especially costly due to UTF-8. Only checking a Json document can be a bit faster, but for most library/tools, the parsing is not well optimized and would be rather slow even without the other overheads. SimdJson is a very fast implementation using SIMD instruction to parse Json document much faster than a lot of library (generally between 100 MB/s and 1 GB/s). It should check errors correctly and can perform a fast lazy document evaluation. It is available in C++ and Python (although the Python package is slower mainly because of the expensive Python object construction).
However, one main issue in your case is the decompression of gzipped files. Gzip is a pretty slow format to compress/decompress and poorly suited for modern computers : the decompression of a gzipped stream cannot be efficiently done with multiple core nor using SIMD instruction. Much faster compression/decompression methods exists for new emerging formats like LZ4 or Zstd. If you cannot change the input format, then you will likely be bound by the decompression speed. If there is multiple files, you can decompress the stream in parallel. You can use GNU parallel to do that easily.
I recently started reading and employing gRPC in my work. gRPC uses protocol-buffers internally as its IDL and I keep reading everywhere that protocol-buffers perform much better, faster as compared to JSON and XML.
What I fail to understand is - how do they do that? What design in protocol-buffers actually makes them perform faster compared to XML and JSON?
String representations of data:
require text encode/decode (which can be cheap, but is still an extra step)
requires complex parse code, especially if there are human-friendly rules like "must allow whitespace"
usually involves more bandwidth - so more actual payload to churn - due to embedding of things like names, and (again) having to deal with human-friendly representations (how to tokenize the syntax, for example)
often requires lots of intermediate string instances that are used for member-lookups etc
Both text-based and binary-based serializers can be fast and efficient (or slow and horrible)... just: binary serializers have the scales tipped in their advantage. This means that a "good" binary serializer will usually be faster than a "good" text-based serializer.
Let's compare a basic example of an integer:
json:
{"id":42}
9 bytes if we assume ASCII or UTF-8 encoding and no whitespace.
xml:
<id>42</id>
11 bytes if we assume ASCII or UTF-8 encoding and no whitespace - and no namespace noise like namespaces.
protobuf:
0x08 0x2a
2 bytes
Now imagine writing a general purpose xml or json parser, and all the ambiguities and scenarios you need to handle just at the text layer, then you need to map the text token "id" to a member, then you need to do an integer parse on "42". In protobuf, the payload is smaller, plus the math is simple, and the member-lookup is an integer (so: suitable for a very fast switch/jump).
While binary protocols have an advantage in theory, in practice, they can lose in performance to JSON or other protocol with textual representation depending on the implementation.
Efficient JSON parsers like RapidJSON or jsoniter-scala parse most JSON samples at speed 2-8 cycles per byte. They serialize even more efficiently, except some edge cases like numbers with floating points when serialization speed can drop down to 16-32 cycles per byte.
But for most domains which don't have a lot of floats or doubles their speed is quite competitive with the best binary serializers. Please see results of benchmarks where jsoniter-scala parses and serializes on par with Java and Scala libraries for ProtoBuf:
https://github.com/dkomanov/scala-serialization/pull/8
I'd have to argue that Binary Protocols will typically always win in performance vs text based protocols. Ha, you won't find many (or any) video streaming applications using JSON to represent the frame data. However, any poorly designed data structure will struggle when being parsed. I've worked on many communications projects to where the text based protocols were replaced with "binary protocols".
It's known (see this answer here), that Couchbase provides binary data as base64-encoded document, when using MapReduce queries.
However, does it stores it as base64 too? From libcouchbase's perspective, it takes a byte array + length, does it gets converted to base64 later?
The Couchbase storage engine stores your data exactly as-as (i.e. the stream of bytes of the length you specify) internally. When reading that data using the CRUD key/value API at the protocol level you get back the exact same stream of bytes.
This is possible because the low-level key-value protocol is binary on the wire, and so there are no issues with using all 8 bits per byte.
Different Client SDKs will expose that in different ways to you. For example:
The C SDK (being low-level) directly gives you back a char* buffer and length.
The Python SDK provides a Transcoding feature where is uses a flag in the documents' metadata to encode the type of the document, so it can automatically convert it to the original type, for example a Python serialised object or a JSON object.
On the other hand, the Views API is done over HTTP with JSON response objects. JSON cannot directly encode 8-bit binary data, so Couchbase needs to use base64 encoding for the view response objects if they contain binary data.
(As an aside, this is one of the reasons why it is recommended to have an index emit the minimum amount of data needed, for example just the key of the document(s) of interest, then use the CRUD key/value interface to actually get the document - the Key/Value interface doesn't have the base64 overhead when transmitting data back.
using JSON.parse is the most common way to parse a JSON string into a JavaScript object.
it is a synchronous code, but does it actually block the event loop (since its much lower level than the user's code) ?
is there an easy way to parse JSON asynchronously? should it matter at all for few KBs - few hundred KBs of JSON data ?
A function that does not accept a callback or return a promise blocks until it returns a value.
So yes it JSON.parse blocks. Parsing JSON is a CPU intensive task, and JS is single threaded. So the parsing would have to block the main thread at some point. Async only makes sense when waiting on another process or system (which is why disk I/O and networking make good async sense as they have more latency than raw CPU processing).
I'd first prove that parsing JSON is actually a bottleneck for your app before you start optimizing it's parsing. I suspect it's not.
If you think that you might have a lot of heavy JSON decoding to do, consider moving it out to another process. I know it might seem obvious, but the key to using node.js successfully is in the name.
To set up another 'node' top handle a CPU heavy task, use IPC. Simple sockets will do, but ØMQ adds a touch of radioactive magic in that it supports a variety of transports.
It might be that the overhead of connecting a socket and sending the JSON is more intensive overall but it will certainly stop the blocking.