Can I get more explanations for BSON? - json

I am trying to understand BSON via http://bsonspec.org/#/specification, but still some questions remain.
let's take an example from the web site above:
{"hello": "world"} → "\x16\x00\x00\x00\x02hello\x00\x06\x00\x00\x00world\x00\x00"
Question 1
in the above example, for the encoded bytes results, the double quotes actually are not part of the results, right?
Question 2
I understand that the first 4 bytes \x16\x00\x00\x00 is the size of the whole BSON doc.
And it is little endian format. But why? Why not take big endian?
Question 3
How comes the size of the example doc being \x16, i.e. 22?
Question 4
Normally, if I want to encode the doc by myself, how do I calculate the size of the doc? I think my trouble majorly is how to decide the size of UTF-8 string?
Let's take another example:
{"BSON": ["awesome", 5.05, 1986]}
→
"\x31\x00\x00\x00\x04BSON\x00\x26\x00\x00\x00\x020\x00\x08\x00\x00
\x00awesome\x00\x011\x00\x33\x33\x33\x33\x33\x33\x14\x40\x102\x00\xc2\x07\x00\x00
\x00\x00"
Question 5
In this example, there is an array. according to the specification, for array, it is actually a list of {key, value} pairs, whereas the key is 0, 1, etc. My question is so the 0, 1 here are strings too, right?

Question 1
in the above example, for the encoded bytes results, the double quotes actually are not part of the results, right?
The quotes are not part of the strings. They're used to mark JSON strings
Question 2
And it is little endian format. But why? Why not take big endian?
Choice of endianness is largely a matter of preference. One advantage of little endian is that commonly used platforms are little endian, and thus don't need to reverse the bytes.
Question 3
How comes the size of the example doc being \x16, i.e. 22?
There are 22 bytes (including the length prefix)
Question 4
Normally, if I want to encode the doc by myself, how do I calculate the size of the doc? I think my trouble majorly is how to decide the size of UTF-8 string?
First write out the document, and then go back to fill in the length.
Question 5
n this example, there is an array. according to the specification, for array, it is actually a list of {key, value} pairs, whereas the key is 0, 1, etc. My question is so the 0, 1 here are strings too, right?
Yes. Zero terminated strings without length prefix to be exact. (Called cstring in the list). Just like an embedded document.

Related

When JSON is send over network how are numbers represented (as binary or text)?

This might be a trivial question... Or might not be. When I serialize an object to JSON how are numbers represented?
Specifically, I need to know how efficiently they are encoded to binary. There are 2 ways:
Transform number to its decimal string representation and then encode that string to binary.
Or encode the number directly to binary.
Which is the case?
That is a big difference: Let's say serialized object contains number 12345678. Encoded first way it will take 8 B to transfer, encoded second way only 4 B. When it comes to lots of big numbers (my case) than in the first case I would better use base64 as pre-process for serialization.
I can imagine that this might be dependent on serializer (though I really hope it is not). In that case, I am using Firebase Realtime database SDK.
JSON is a textual notation. So the number 12345678 is sent as those eight characters, 1, 2, 3, etc. Depending on your text encoding, that's probably eight bytes (e.g., UTF-8 or Windows-1252; but if you were using UTF-16, for instance, it would be 16 bytes).
There have been various "binary JSON" proposals over the years, but I don't think any of them really caught on outside of specific applications (for instance, BSON in MongoDB).

json boolean vs integer - which takes up less space?

When sending a value in JSON otw, is it better to use a boolean or an integer to use up less space?
e.g:
{
foo: false
}
Or:
{
foo: 0
}
Would using a number use less space, considering its just a number, compared to 4 or 5 characters for a boolean value? (true/false)
Also is there a speed difference between the two approaches if you convert them from JSON to object format?
Firstly, this is micro-optimisation, and very unlikely to be important. If you are transporting thousands or millions of such values, it might become significant; but in that case, you probably want something much more efficient than JSON anyway (a plain CSV would be better in many cases, but ideally you'd use some packed binary format).
Secondly, JSON is a way of representing data in a string; so storing or sending JSON means you are storing or sending strings. Measuring the size of the data is therefore trivial: how long is the string? The string 0 has one character; the string false has five characters.
Thirdly, if you're optimising for space, you'd remove all insignificant whitespace, so your examples should be {"foo":false} (13 characters) and {"foo":0} (9 characters). Note that you can't, as you have in your example, skip the quote marks around foo - that is not valid JSON.
Fourthly, how much memory or other resources the structure will take up when you convert it from JSON into an object depends on what language you're using, what implementation of that language, and any number of other factors, so is completely unanswerable (and, again, a micro-optimisation that is very unlikely to be important).
I think integer is a better solution because, besides using less space (and consequentially, being potentially faster to parse), it is also more future proof. Someone can easily convert it into a three (or more) state variable if needed by just assigning other values like -1, 2, 3..., while the conversion from boolean would be less straight forward.

Tiff versus BigTiff

Please let me know if there is another Stack Exchange community this question would be better suited for.
I am trying to understand the basic differences between Tiff and BigTiff. I have looked on various sites and the only difference that is mentioned is that BigTiff uses 64-bit offsets while Tiff uses 32-bit offsets. That being said, you would need to know which of the two types you are reading. How is this done? According to https://www.leadtools.com/help/leadtools/v19/main/api/tifffmt.html, this is done by reading a file flag. However, the flag they are referring to appears to be unique to their own reader as I cannot find a corresponding data field in the specifications as shown by http://www.fileformat.info/format/tiff/egff.htm. What am I missing? Does BigTiff use a different file header than Tiff?
Everything you need to know is described in the BigTIFF link posted by #cgohlke. This is just to provide an answer to your question:
Yes, it uses a different file header.
Normal TIFF uses the following header:
2 byte byte order mark, "II" for "Intel"/little endian, or "MM" for "Motorola"/big endian.
The (version) number 42* as a 16 bit value, in the endianness given.
Unsigned 32 bit offset to IFD0
BigTIFF uses a slightly different header:
2 byte byte order mark as above
The (version) number 43 as a 16 bit value, in the endianness given.
Byte size of offset as a 16 bit value, always 8 for BigTIFF
2 byte padding, always 0 for BigTIFF
Unsigned 64 bit offset to IFD0
*) The value 42 was chosen for its "deep philosophical significance". Or according to the official specification, "[a]n arbitrary but carefully chosen number"...

Bulk export of binary waveform data from oscilloscope to data points (csv preferred)

I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.

How can you reverse engineer a binary thrift file?

I've been asked to process some files serialized as binary (not text/JSON unfortunately) Thrift objects, but I don't have access to the program or programmer that created the files, so I have no idea of their structure, field order, etc. Is there a way using the Thrift libraries to open a binary file and analyze it, getting a list of the field types, values, nesting, etc.?
Unfortunately it appears that Thrift's binary protocol does not do very much tagging of data at all; to decode it appears to assume you have the .thrift file in hand so you know, say, the next 4 bytes are supposed to be an integer, and aren't actually the first half of a float. So it appears you are stuck with, basically, looking at the files in a hex editor (or equivalent) and trying to deduce fields based on the exact patterns you're seeing.
There are a very few helpful bits:
Each file begins with a version, protocol identifier string, and sequence number. Maps will begin with 6 bytes that identify the key and value types (first two bytes, as integer codes) plus the number of elements as a 4 byte integer. The type codes appear to be standard (the canonical location of their definitions seems to be TProtocol.h in the Thrift sources, for instance a boolean value is specified by type code 2, UTF-8 string by type code 16, and so on). Strings are prefixed by a 4 byte integer length field, and lists are prefixed by the type (1 byte) and a 4 byte length. It looks like all integer fields are saved big-endian, and floating points are saved in IEEE format (which should make doubles relatively easy to find, at least).
The TBinaryProtocol* files in Thrift have a few more helpful details; on the plus side, there are a number of different implementations so you can read the ones implemented in the language you are most comfortable with.
Sorry, I know this probably isn't that helpful but it really does appear this is all the information the Thrift binary format provides; clearly the binary format was designed with the intent that you would always know the exact protocol spec already, and that the goal was the minimize wire space, rather than make it at all easy to decode blindly.