Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 months ago.
Improve this question
Are there any compact binary representations of JSON out there? I know there is BSON, but even that webpage says "in many cases is not much more efficient than JSON. In some cases BSON uses even more space than JSON".
I'm looking for a format that's as compact as possible, preferably some kind of open standard?
You could take a look at the Universal Binary JSON specification. It won't be as compact as Smile because it doesn't do name references, but it is 100% compatible with JSON (where as BSON and BJSON define data structures that don't exist in JSON so there is no standard conversion to/from).
It is also (intentionally) criminally simple to read and write with a standard format of:
[type, 1-byte char]([length, 4-byte int32])([data])
So simple data types begin with an ASCII marker code like 'I' for a 32-bit int, 'T' for true, 'Z' for null, 'S' for string and so on.
The format is by design engineered to be fast-to-read as all data structures are prefixed with their size so there is no scanning for null-terminated sequences.
For example, reading a string that might be demarcated like this (the []-chars are just for illustration purposes, they are not written in the format)
[S][512][this is a really long 512-byte UTF-8 string....]
You would see the 'S', switch on it to processing a string, see the 4-byte integer that follows it of "512" and know that you can just grab in one chunk the next 512 bytes and decode them back to a string.
Similarly numeric values are written out without a length value to be more compact because their type (byte, int32, int64, double) all define their length of bytes (1, 4, 8 and 8 respectively. There is also support for arbitrarily long numbers that is extremely portable, even on platforms that don't support them).
On average you should see a size reduction of roughly 30% with a well balanced JSON object (lots of mixed types). If you want to know exactly how certain structures compress or don't compress you can check the Size Requirements section to get an idea.
On the bright side, regardless of compression, the data will be written in a more optimized format and be faster to work with.
I checked the core Input/OutputStream implementations for reading/writing the format into GitHub today. I'll check in general reflection-based object mapping later this week.
You can just look at those two classes to see how to read and write the format, I think the core logic is something like 20 lines of code. The classes are longer because of abstractions to the methods and some structuring around checking the marker bytes to make sure the data file is a valid format; things like that.
If you have really specific questions like the endianness (Big) of the spec or numeric format for doubles (IEEE 754) all of that is covered in the spec doc or just ask me.
Hope that helps!
Yes: Smile data format (see Wikipedia entry. It has public Java implementation, C version in the works at github (libsmile). It has benefit of being more compact than JSON (reliably), but being 100% compatible logical data model, so it is easy and possible to convert back and forth with textual JSON.
For performance, you can see jvm-serializers benchmark, where smile competes well with other binary formats (thrift, avro, protobuf); sizewise it is not the most compact (since it does retain field names), but does much better with data streams where names are repeated.
It is being used by projects like Elastic Search and Solr (optionally), Protostuff-rpc supports it, although it is not as widely as say Thrift or protobuf.
EDIT (Dec 2011) -- there are now also libsmile bindings for PHP, Ruby and Python, so language support is improving. In addition there are measurements on data size; and although for single-record data alternatives (Avro, protobuf) are more compact, for data streams Smile is often more compact due to key and String value back reference option.
gzipping JSON data is going to get you good compression ratios with very little effort because of its universal support. Also, if you're in a browser environment, you may end up paying a greater byte cost in the size of the dependency from a new library than you would in actual payload savings.
If your data has additional constraints (such as lots of redundant field values), you may be able to optimize by looking at a different serialization protocol rather than sticking to JSON. Example: a column-based serialization such as Avro's upcoming columnar store may get you better ratios (for on-disk storage). If your payloads contain lots of constant values (such as columns that represent enums), a dictionary compression approach may be useful too.
Another alternative that should be considered these days is CBOR (RFC 7049), which has an explicitly JSON-compatible model with a lot of flexibility. It is both stable and meets your open-standard qualification, and has obviously had a lot of thought put into it.
Have you tried BJSON ?
Try to use the js-inflate to make and unmake blobs.
https://github.com/augustl/js-inflate
This is perfect and I use a lot.
You might also want to take a look at a library I wrote. It's called minijson, and it was designed for this very purpose.
It's Python:
https://github.com/Dronehub/minijson
Have you tried AVRO? Apache Avro
https://avro.apache.org/
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
RFC 7159 says
JavaScript Object Notation (JSON) is a text format for the
serialization of structured data.
But Go treats JSON as []byte
func Marshal(v interface{}) ([]byte, error)
func Unmarshal(data []byte, v interface{}) error
Why don't these functions take and return a string?
I could not find any explanation here
https://golang.org/pkg/encoding/json/
https://blog.golang.org/json-and-go
Go does not go by "strings are for text, byte types are for other stuff" like some other languages (e.g. Python 3) do. "In Go, a string is in effect a read-only slice of bytes." The string type has a few behaviors attached that are handy for dealing with UTF-8 text, but it'll hold whatever bytes you put in it. Text-handling stuff in the standard library is often written to work with []bytes too, e.g. package bytes mirrors package strings and regexp deals in either.
Given that there's no rule about text/binary semantically belonging in one type or the other, the choice to use []byte was probably made for practical reasons. Since strings are read-only slices of bytes, almost all operations changing strings have to copy bytes to a new string instead of modifying the existing one. (String slicing is a key exception; it just makes a new string header that can point into the old string's bytes.)
Copying string contents for each operation leads to a quadratic slowdown as the string length and number of copies both grow with input size. On top of the direct cost of the copies, allocating the space for them makes garbage collection happen more often. For those reasons, almost everything that builds up content via a lot of small operations in Go uses a []byte internally. That includes Go's JSON-marshalling code, and the strings.Builder class added in Go 1.10.
(For similar reasons, Java and C# offer string-builder types as well and modern JavaScript VMs have clever tricks to defer copying bytes until after a long series of concat operations, such as V8's cons strings and SpiderMonkey's ropes.)
Because []bytes are read-write and strings are read-only, converting one to the other also has to copy bytes. If MarshalJSON returned a string, that would require making another copy of the content (and the associated load on the GC). Also, if you're ultimately going to do I/O with this, Write() takes a byte slice, so for that you'd have to convert back, creating another copy. (To slightly mitigate that, some I/O types including *os.File support WriteString() as well. But not all do!)
So it makes more sense for json.Encoder to return the []byte it built up internally; you can of course call string(bytes) on the result if you need a string and the copying isn't a problem.
A bit out of the original question's scope, but often the best performing option is just to stream the output directly to an io.Writer using a json.Encoder. You never have to allocate the whole chunk of output at once, and it can make your code simpler as well since there's no temp variable and you can handle marshalling and I/O errors in one place.
Trying to scrape a webpage, I hit the necessity to work with ASP.NET's __VIEWSTATE variables. So, ever the optimist, I decided to read up on those variables, and their formats. Even though classified as Open Source by Microsoft, I couldn't find any formal definition:
Everybody agrees the first step to do is decode the string, using a Base64 decoder. Great - that works...
Next - and this is where the confusion sets in:
Roughly 3/4 of the decoders seem to use binary values (characters whose values indicate the the type of field which is follow). Here's an example of such a specification. This format also seems to expect a 'signature' of 0xFF 0x01 as first two bytes.
The rest of the articles (such as this one) describe a format where the fields in the format are separated (or marked) by t< ... >, p< ... >, etc. (this seems to be the case of the page I'm interested in).
Even after looking at over a hundred pages, I didn't find any mention about the existence of two formats.
My questions are: Are there two different formats of __VIEWSTATE variables in use, or am I missing something basic? Is there any formal description of the __VIEWSTATE contents somewhere?
The view state is serialized and deserialized by the
System.Web.UI.LosFormatter class—the LOS stands for limited object
serialization—and is designed to efficiently serialize certain types
of objects into a base-64 encoded string. The LosFormatter can
serialize any type of object that can be serialized by the
BinaryFormatter class, but is built to efficiently serialize objects
of the following types:
Strings
Integers
Booleans
Arrays
ArrayLists
Hashtables
Pairs
Triplets
Everything you need to know about ViewState: Understanding View State
I am using the envelope pattern, and my canonical model part is in XML format. I usually return the model in full or in a summary version. Retrieval of documents is pretty quick, but when returning as part of my REST call, where I need to return JSON to the browser, my json:transform-to-json takes double the version of the call that just returns the XML.
Is a strategy to also have the canonical model in JSON format as well in the envelope, or to maybe have rendered json in full and summary formats in other documents outside of the envelope, which don't get searched, but are mainly used when returning results? This way I don't have to incur the hit for transforming the canonical model to JSON all the time.
Are there any other ways that this has been done?
Conversion from XML to JSON should be relatively light, but the mere fact it has to do something will take overhead. Doing that work upfront will definitely save time. You can put both formats in the same envelop (though JSON will have to be stored as string then), or in a different document as you suggest. Alternatively you could also store it in document-properties. Unfortunately, that only takes XML as well, so you will be storing your JSON as string in there too.
Alternatively, have you profiled the transform to see if there is a particular reason why it slows down so much? Using XSLT versus XQuery for the transform could make a difference too..
HTH!
json:transform-to-json has 3 algorithms optimized for different purposes and will perform with different tradeoffs of flexabilty, fidelity and performance.
"basic" (default) useful only to reverse json:transform-from-json()
"full" - to preserve as much information fidelity as possible, in exchange for a non 'prety' format in many cases.
"custom" - is ... custom ... designed when the json format is fixed or when you want control over the json output at the expense of handling a subset of XML accurately.
Basic and full are the most efficient. However all variants are fairly involved and require completely traversing the XML node tree and creating bottom up a JSON object tree. In ML version 8 this is then translated into the native JSON node structure. In a REST call it would then be serialized as text.
Compared to a direct return of an xml document vi fn:doc("file.xml") there is atleast 2 orders of magnitude more operations involved in the transform case.
For small documents in a REST call that still a small fraction of the total request time, expecialy if the REST call was performing a complex operation itself then returning a small result. Your use case seems the opposite - returning a xml document directly bypasses almost all of the XQuery processing and is sent directly from internal to the output or assigned to a variable.
If that an important use case to optimize, especially if the documents can be large, then saving them as text or binary will be much faster -- at the expense of more storage used. If this is only a variant representation of the xml, try storing the text JSON as binary as it will not incur any indexing overhead.
Otherwise if you need to query over the JSON then in ML7 storing as text gives you simple word queries, in ML8 storing as native JSON gives you structured queries -- both with efficient text serialization.
Can anyone explain me the differences between scan and binary scan .
format and binary format .
I am getting confusion with the binary commands .
To understand the difference between command sets manipulating binary and string data you have to understand the distinction between these two kinds of data.
In Tcl, as in many (most?) high-level languages, strings are rather abstract — that is, they are described in pretty high-level terms. Particularly in Tcl, strings are defined to have the following properties:
They contain characters from the Unicode repertoire.
The Tcl runtime provides the set of standard commands to operate on strings — such as indexing, searching, appending to, extracting a substring etc.
Note that many things are left out from this definition:
The encoding in which these Unicode characters are stored.
How exactly they are stored (NUL-terminated arrays? linked lists of unsigned longs? something else?).
(To put it into a more interesting perspective, Tcl is able to transparently change the underlying representations of strings it manages — between UTF-8 and UTF-16 encoded sequences. But here we're talking about the reference Tcl implementation, and other implementations (such as Jacl for instance) are free to do something else completely.)
The same approach is used to manipulate all the other kinds of data in the Tcl interpreter. Say, integer numbers are stored using native platform "integers" (roughly "as in C") but they are transparently upgraded into arbitrary sized integers if an arithmetic operation is about to overflow the platform-sized result.
So long as you don't leave the comfortable world of the Tcl interpreter, this is all you should know about the data types it manages. But now there's the outside world. In it, abstract concepts which are Tcl strings do not exist. Say, if you need to communicate to some other program over a network socket or by means of using a file or whatever other kind of media, you have to get down to the level of exact layouts of raw bytes which are described by "wire protocols" and file formats or whatever applies to your case. This is where "binaries" come into play: they allow you to precisely specify how the data is laid out so that it's ready to be transferred to the outside world or be consumed from it — binary format makes these "binaries" and binary scan reads them.
Note that certain Tcl commands for working with the outside world are "smart by default" — for instance, the open command which opens files by default assumes they are textual and are encoded in the default system encoding (which is deduced, broadly speaking, from the environment). You can then use the chan configure (of fconfigure — in older versions of Tcl) command to either change this encoding or completely inhibit conversions by specifying the channel is in "binary mode". The same applies to EOL conversions.
Note also that there are specialized packages for Tcl that effectively hide the complexities of working with a particular wire/file format. To present one example, the tdom package works with XML; when you manipulate XML using this package, you're not concerned with how exactly XML must be represented when, say, saved to a file — you just work with tdom's objects, native Tcl strings etc.
The docs are pretty good and contain examples:
scan: http://www.tcl.tk/man/tcl8.6/TclCmd/scan.htm
format: http://www.tcl.tk/man/tcl8.6/TclCmd/format.htm
binary scan: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M42
binary format: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M16
Maybe you could ask a more specific question?
The format command assembles strings of characters, the binary format command assembles strings of bytes. The scan and binary scan commands do the reverse, extracting formation from character strings and byte strings respectively.
Note that Tcl happens to map byte strings neatly onto character strings where the characters are in the range \u0000–\u00FF, and there are other operations for getting information into and out of binary strings that are sometimes relevant. Most notably, encoding convertto and encoding convertfrom: encoding convertto formats a string as a sequence of bytes that represent that string in a given encoding (an operation which can lose information) and encoding converfrom goes in the opposite direction.
So what encoding are Tcl's strings really in? Well, none really. Or many. The logical level works with character sequences exclusively, and the implementation will actually move things back and forth (mostly between a variant of UTF-8 and UCS-2, though with optimisations for handling byte strings via arrays of unsigned char) as necessary. While this is not always perfectly efficient, most code never notices what's going on due to the type-caching used.
If you have Tcl 8.6, you can peek behind the covers to observe the types with an unsupported command:
# Output is human-readable; experiment to see what it says for you
puts [tcl::unsupported::representation $MyString]
Don't use this to base functional decisions on; Tcl is very happy to mutate types out from under your feet. But it can help when finding out why your code is unexpectedly slow. (Note also that types attach to values, and not to variables.)
I have to generate codes with custom fields: id of field+name of field+values of the field.
How long is the data I can encode inside the QRcode? I need to know how many fields\values I can insert.
Should I use XML or JSON or CSV? What is most generic and efficient?
XML / JSON will not qualify for a QR code's alphanumeric mode since it will include lower-case letters. You'll have to use byte mode. The max is 2,953 characters. But, the practical limit is far less -- perhaps a few hundred characters.
It is far better to encode a hyperlink to data if you can.
As Terence says, no reader will do anything with XML/JSON except show it. You need a custom reader anyway to do something useful with that data. (Which suggests this is not a good use case for QR codes.) But if you're making your own reader, you can use gzip compression to make the payload much smaller. Your reader would know to unzip it.
You might get away with something workable but this is not a good approach in general.
The maximum number of alphanumeric characters you can have is 4,296. Although this will require the lowest form of error correction and will be very hard to scan.
JSON is generally more efficient at data storage than XML.
However, you will need to write your own app to scan the code - I don't know of any which will process raw JSON or XML. All the scanners will show you the text, though.