When keeping bandwidth usage low is more important than performance, is it worth it to use binary serialization over JSON? - json

I have a server talking to a mobile app, and this app potentially does thousands of requests per day. I don't care that much about performance in this particular case, so saving some miliseconds isn't as big as a concern as saving bandwidth - especially since I'm paying for it.
(1) What is the advantage of using JSON over binary here, when bandwidth is a much bigger deal than performance? I mean, I have read some people saying that the size difference between raw data and JSON isn't really that much - and that might as well be partially true, but when you have thousands of daily requests being made by hundreds of thousands of users, merely doubling the amount of bytes will have a huge impact on bandwidth usage - and in the end, on the server bill.
Also, some people said that you can easily alter the JSON output format, while changing the binary serialization might be a little more complicated. Again, I agree, but shouldn't it be a little more complicated than that? Like, what are the odds that we're gonna change our format? Will the ease of change make up for JSON's bandwidth excess?
(2) And finally, I stumbled upon this link while doing some research on this topic, and in the summary table (Ctrl + F, 'summary') it says that the JSON data size is smaller than the actual binary data? How is that even possible?
I would very much appreciate some answers to these questions.
Thank you in advance.

thousands of requests per day
that's ... not really a lot, so most approaches will usually be fine
What is the advantage of using JSON over binary here, when bandwidth is a much bigger deal than performance?
JSON wouldn't usually have an advantage; usually that would go to binary protocols - things like protobuf; however, compression may be more significant than choice of protocol. If you want meaningful answers, however, the only way to get that is to test it with your data.
If bandwidth is your key concern, I'd go with protobuf, adding compression on top if you have a lot of text data in your content (text compresses beautifully, and protobuf simply uses UTF8, so it is still available for compression).
it says that the JSON data size is smaller than the actual binary data?
JSON contains textual field names, punctuation (", :, ,), etc - and all values are textual rather than primitive; JSON will be larger than good binary serializers. The article, however, compares to BinaryFormatter; BinaryFormatter does not qualify as a good binary serializer IMO. It is easy to use and works without needing to do anything. If you compare against something like protobuf: protobuf will beat JSON every time. And sure enough, if we look at the table: JSON is 102 or 86 bytes (depending on the serializer); protobuf-net is 62 bytes, MsgPack is 61, BinaryFormatter is 669. Do not conflate "a binary serializer" with BinaryFormatter. I blame the article for this error, not you. MsgPack and Protocol Buffers (protobuf-net) are binary serializers, and they come out in the lead.
(disclosure: I'm the author of protobuf-net)

Related

x264/x265 options for fast decoding while preserving quality

I want to encode some videos in H264 and H265, and I would like to ask for help in order to chose the adequate options in x264/x265.
My first priority is video quality. Lossless would be the best quality for example
My second priority is fast decoding speed (ie: I would like faster decoding without sacrificing quality)
fast decoding for me means that the decoding machine would use less CPU resources reading the resulting video
Less RAM consumption would be a plus
Latency is not important
I don't care that much if the output file ends up bigger. Also, encoding speed is not important
I'm aware of the option -tune fastdecode in both x264 and x265. But apparently the quality gets a bit worse using it.
For x264:
-tune fastdecode is equivalent to --no-cabac --no-deblock --no-weightb --weightp 0 (My source is x264 --fullhelp)
Which options preserve quality ?
For x265:
-tune fastdecode is equivalent to --no-deblock --no-sao --no-weightp --no-weightb --no-b-intra (according to x265 doc)
Again, which options preserve quality ?
I tried to read some documentation on these options, but I'm afraid I'm too stupid to understand if they preserve quality or not.
To explain further what I mean by "preserving quality":
I don't understand what the cabac option does exactly. But let's say for example that it adds some extra lossless compression, resulting in a smaller video file, but the decoding machine would need to do extra work to decompress the file
In that case, the --no-cabac option would skip that extra compression, resulting in no loss of quality, with a bigger file size, and the decoding machine would not need to decompress the extra compression, saving CPU work on the decoding side
In this scenario, I would like to add the --no-cabac option, as it speeds up decoding, while preserving quality.
I hope I could get my point across
Can anyone help me pick the right options ?
Thanks in advance
As #O.Jones illuded to "it depends" is the best answer you're going to get. As the author of Compressarr, I can tell you figuring the answer to that question, per video, takes hours.
Basically, encode the video (using best guess values), check it for SSIM against the original. Decide what percentage of similarity is acceptable. (Above 98% is supposed to be an imperceivable difference) Check your file size and repeat with new options as necessary. You can get away with using samples, so you don't have to do the whole thing, but then the final SSIM is only a guess.
Or just use Compressarr and let it automate all that for you. :D

Considerations for binary seralizations (Protobuf, CBOR, MessagePack, etc.) for a long term archive data format

In discussions for a next generation scientific data format a need for some kind of JSON-like data structures (logical grouping of fieldshas been identified. Additionally, it would be preferable to leverage an existing encoding instead of using a custom binary structure. For serialization formats there are many options. Any guidance or insight from those that have experience with these kinds of encodings is appreciated.
Requirements: In our format, data need to be packed in records, normally no bigger than 4096-bytes. Each record must be independently usable. The data must be readable for decades to come. Data archiving and exchange is done by storing and transmitting a sequence of records. Data corruption must only effect the corrupted records, leaving all others in the file/stream/object readable.
Priorities (roughly in order) are:
stability, long term archive usage
performance, mostly read
ability to store opaque blobs
size
simplicity
broad software (aka library) support
stream-ability, transmitted and readable as a record is generated (if possible)
We have started to look at Protobuf (Protocol Buffers RFC), CBOR (RFC) and a bit at MessagePack.
Any information from those with experience that would help us determine the best fit or, more importantly, avoid pitfalls and dead-ends, would be greatly appreciated.
Thanks in advance!
Late answer but: You may want to decide if you want a schema-based or self-describing format. Amazon Ion overview talks about some of the pros and cons of these design decisions, plus this other ION ( completely different from Amazon Ion ).
Neither of those fully meet your criteria, But these articles should list a few criteria you might want to consider. Obviously actually being a standard and being adopted are far higher guarantees of longevity than any technical design criteria
Your goal of recovery from data corruption almost certainly something that should be addressed in a separate architectural layer from the matter of encoding of the records. How many records to pack in to a blob/file/stream is really more related to how many records you can afford to sequentially read through before finding the one you might need.
An optimal solution to storage corruption depends on what kind of corruption you consider likely. For example, if you store data on spinning disks your best protection might be different from if you store data on tape. But the details of that are really not an application-level concern. It's better to abstract/outsource that sort of concern.
Modern cloud-based data storage services provide extremely robust protection against corruption, measured in the industry as "durability". For example, even the Microsoft Azure's lowest-cost storage option, Locally Redundant Storage (LRS), stores at least three different copies of any data received, and maintains at least that level of protection for as long as you want. If any copy gets damaged, another is made from one of undamaged ones ASAP. That results in an annual "durability" of 11 nines (99.999999999% durability), and that's the "low-cost" option at Microsoft. The normal redundancy plan, Geo Redundant Storage (GRS), offers durability exceeding 16 nines. See Azure Storage redundancy.
According to Wasabi, eleven-nines durability means that if you have 1 million files stored, you might lose one file every 659,000 years. You are about 411 times more likely to get hit by a meteor than losing a file.
P.S. I previously worked on the Microsoft Azure Storage team, so that's the service I know the best. However, I trust that other cloud-storage options (e.g. Wasabi and Amazon's S3) offer similar durability protection, e.g. Amazon S3 Standard and Wasabi hot storage are like Azure LRS: eleven nines durability. If you are not worried about a meteor strike, you can rest assured you these services won't lose your data anytime soon.

Really streaming large MySQL result sets

There's a similar question about streaming large results but the answer just points at docs and no clear answer emerges.
I believe that merely treating a full result set as a stream still takes a lot of memory on the jdbc driver side..
I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
And in particular I am not sure why setFetchSize(Integer.MIN_VALUE) is a very good idea, as it seems far from optimal if that means each row is sent on its own on the wire.
I believe libraries like jooq and slick already take care of that... and am curious as to how to accomplish it with and without them.
Thanks!
I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
The best practice is not to do synchronous streaming but rather fetch in moderate size chunks. However avoid using OFFSET (also see). If your doing a batch process this can be facilitated by first selecting and pushing the data into a temporary table (ie turn your original results you want into a table first and then select chunks from the table... databases are really fast at copying data internally).
Synchronous streaming in general does not scale (aka iterator). It does not scale well for batch processing and it certainly does not scale for handling loads of clients. This is why the drivers vary and do so many different things because its fairly difficult to figure out how much resources to load because it is a pull model. Async streaming (push model) would probably help but unfortunately the JDBC standard does not support async streaming.
You might notice but this is one of the reasons why many of the wrappers around JDBC such as Spring JDBC do not return Iterators (along with fact that the resource also needs to be manually cleaned up). Some of the wrappers provide iterators but really they just turn the results into a list .
Your link to the Scala version is rather disturbing that its upvoted given the stateful nature of managing a ResultSet... its very un-Scala like... I'm not sure those folks know they have to consume the iterator or close the connection/ResultSet properly which requires a fair amount of imperative programming.
While it may seem inefficient to let the database decide how much to buffer just remember that most database connections are extremely heavy memory wise (at least on postgres they are). So if you take a long time streaming and have many clients your going to have to create more connections and put serious burden on the database. Not to mention the default buffers have probably been highly optimized (ie the resultset size that client ends up with).
Finally for batch processing chunks can be done in parallel which is obviously more efficient than a synchronous pipeline as well as being restarted (with out having to rework already processed data) if a problem occurs.

Are cryptographic hashes injective under certain conditions?

sorry for the lengthy post, I have a question about common crpytographic hashing algorithms, such as the SHA family, MD5, etc.
In general, such a hash algorithm cannot be injective, since the actual digest produced has usually a fixed length (e.g., 160 bits under SHA-1, etc.) whereas the space of possible messages to be digested is virtually infinite.
However, if we generate a digest of a message which is at most as long as the digest generated, what are the properties of the commonly used hashing algorithms? Are they likely to be injective on this limited message space? Are there algorithms, which are known to produce collisions even on messages whose bit lengths is shorter than the bit length of the digest produced?
I am actually looking for an algorithm, which has this property, i.e., which, at least in principle, may generate colliding hashes even for short input messages.
Background: We have a browser plug-in, which for every web-site visited, makes a server request asking, whether the web-site belongs to one of our known partners. But of course, we don't want to spy on our users. So, in order to make it hard to generate some kind of surf history, we do not actually send the URL visited, but a hash digest (currently, SHA-1) of some cleaned up version. On the server side, we have a table of hashes of well-known URIs, which is matched against the received hash. We can live with a certain amount of uncertainty here, since we consider not being able to track our users a feature, not a bug.
For obvious reasons, this scheme is pretty fuzzy and admits false positives as well as URIs not matched which should have.
So right now, we are considering changing the fingerprint generation to something, which has more structure, for example, instead of hashing the full (cleaned up) URI, we might instead:
split the host name into components at "." and hash those individually
check the path into components at "." and hash those individually
Join the resulting hashes into a fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as hash) might yield "46989670.104268307.41353536/19857610/73204162".
However, as such a fingerprint has a lot of structure (in particular, when compared to a plain SHA-1 digest), we might accidentally make it pretty easy again to compute actual URI visited by a user (for example, by using a pre-computed table of hash values for "common" compont values, such as "www").
So right now, I am looking for a hash/digest algorithm, which has a high rate of collisions (Adler 32 is seriously considered) even on short messages, so that the probability of a given component hash being unique is low. We hope, that the additional structure we impose provides us with enough additional information to as to improve the matching behaviour (i.e., lower the rate of false positives/false negatives).
I do not believe hashes are guaranteed to be injective for messages the same size the digest. If they were, they would be bijective, which would be missing the point of a hash. This suggests that they are not injective for messages smaller than the digest either.
If you want to encourage collisions, i suggest you use any hash function you like, then throw away bits until it collides enough.
For example, throwing away 159 bits of a SHA-1 hash will give you a pretty high collision rate. You might not want to throw that many away.
However, what you are trying to achieve seems inherently dubious. You want to be able to tell that the URL is one of your ones, but not which one it is. That means you want your URLs to collide with each other, but not with URLs that are not yours. A hash function won't do that for you. Rather, because collisions will be random, since there are many more URLs which are not yours than which are (i assume!), any given level of collision will lead to dramatically more confusion over whether a URL is one of yours or not than over which of yours it is.
Instead, how about sending the list of URLs to the plugin at startup, and then having it just send back a single bit indicating if it's visiting a URL in the list? If you don't want to send the URLs explicitly, send hashes (without trying to maximise collisions). If you want to save space, send a Bloom filter.
Since you're willing to accept a rate of false positives (that is, random sites identified as whitelisted when in fact they are not), a Bloom filter might be just the thing.
Each client downloads a Bloom filter containing the whole whitelist. Then the client has no need to otherwise communicate with the server, and there is no risk of spying.
At 2 bytes per URL, the false positive rate would be below 0.1%, and at 4 bytes per URL below 1 in 4 million.
Downloading the whole filter (and perhaps regular updates to it) is a large investment of bandwidth up front. But supposing that it has a million URLs on it (which seems quite a lot to me, given that you can probably apply some rules to canonicalize URLs before lookup), it's a 4MB download. Compare this with a list of a million 32 bit hashes: same size, but the false positive rate would be somewhere around 1 in 4 thousand, so the Bloom filter wins for compactness.
I don't know how the plugin contacts the server, but I doubt that you can get an HTTP transaction done in much under 1kB -- perhaps less with keep-alive connections. If filter updates are less frequent than one per 4k URL visits by a given user (or a smaller number if there are less than a million URLs, or greater than 1 in 4 million false positive probability), this has a chance of using less bandwidth than the current scheme, and of course leaks much less information about the user.
It doesn't work quite as well if you require new URLs to be whitelisted immediately, although I suppose the client could still hit the server at every page request, to check whether the filter has changed and if so download an update patch.
Even if the Bloom filter is too big to download in full (perhaps for cases where the client has no persistent storage and RAM is limited), then I think you could still introduce some information-hiding by having the client compute which bits of the Bloom filter it needs to see, and asking for those from the server. With a combination of caching in the client (the higher the proportion of the filter you have cached, the fewer bits you need to ask for and hence the less you tell the server), asking for a window around the actual bit you care about (so you don't tell the server exactly which bit you need), and the client asking for spurious bits it doesn't really need (hide the information in noise), I expect you could obfuscate what URLs you're visiting. It would take some analysis to prove how much that actually works, though: a spy would be aiming to find a pattern in your requests that's correlated with browsing a particular site.
I'm under the impression that you actually want public key cryptography, where you provide the visitor with a public key used to encode the URL, and you decrypt the URL using the secret key.
There are JavaScript implementations a bit everywhere.

Does anyone know how I can store large binary values in Riak?

Does anyone know how I can store large binary values in Riak?
For now, they don't recommend storing files larger than 50MB in size without splitting them. See: FAQ - Riak Wiki
If your files are smaller than 50MB, than proceed as you would with storing non binary data in Riak.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.Schema Design in Riak - Introduction
#Brian Mansell's answer is on the right track - you don't really want to store large binary values (over 50 MB) as a single object, in Riak (the cluster becomes unusably slow, after a while).
You have 2 options, instead:
1) If a binary object is small enough, store it directly. If it's over a certain threshold (50 MB is a decent arbitrary value to start with, but really, run some performance tests to see what the average object size is, for your cluster, after which it starts to crawl) -- break up the file into several chunks, and store the chunks separately. (In fact, most people that I've seen go this route, use chunks of 1 MB in size).
This means, of course, that you have to keep track of the "manifest" -- which chunks got stored where, and in what order. And then, to retrieve the file, you would first have to fetch the object tracking the chunks, then fetch the individual file chunks and reassemble them back into the original file. Take a look at a project like https://github.com/podados/python-riakfs to see how they did it.
2) Alternatively, you can just use Riak CS (Riak Cloud Storage), to do all of the above, but the code is written for you. That's exactly how RiakCS works -- it breaks an incoming file into chunks, stores and tracks them individually in plain Riak, and reassembles them when it comes time to fetch it back. And provides an Amazon S3 API for file storage, for your convenience. I highly recommend this route (so as not to reinvent the wheel -- chunking and tracking files is hard enough). Yes, CS is a paid product, but check out the free Developer Trial, if you're curious.
Just like every other value. Why would it be different?
Use either the Erlang interface ( http://hg.basho.com/riak/src/461421125af9/doc/basic-client.txt ) or the "raw" HTTP interface ( http://hg.basho.com/riak/src/tip/doc/raw-http-howto.txt ). It should "just work."
Also, you'll generally find a better response on the riak-users mailing list than you will here. http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com (No offense to z8000, who seems to also have answers.)