Confused with endianess: bits or bytes? - terminology

I extracted this from a tutorial:
Little-Endian order is the one we will be using in this document, and unless stated specifically you should assume that Little-Endian order is used in any file. The alternate is Big-Endian ordering. So let’s see an example. Take the following stream or 8 bits 10001110 If you have been following the document so far, you would quickly calculate the value of this 8-bit number as being 1x2^7 + 0x2^6 + … + 1x2^1 + 0x2^0 = 142 This is an example of Little-Endian ordering. However, in Big-Endian ordering we need to read the number in the opposite direction 1x2^0 + 0x2^1 + … + 1x2^6 + 0x2^7 = 113
Is this correct?
I used to think that endianess has to do with order that the BYTES (not the bits) are read.

Yes, in the context of memory/storage, endianness indeed refers to byte ordering (typically). What would it mean to say that e.g. the least-significant bit "comes first"?
Bit endianness is relevant in some situations, for instance when sending data over a serial bus.

You are correct - that quote you have there is rubbish, IMHO.

It wouldn't make sense to reorder bits, and it would be pretty confusing to boot. CPUs don't read simgle bits, they read bytes, or combinations of bytes, at one time, so that's the ordering that's important.
When they store a number made up of multiple bytes, they can either store it from left to right, making the high-order byte lowest in memory, or right to left, with the low-order byte lowest in memory.

Related

Tiff versus BigTiff

Please let me know if there is another Stack Exchange community this question would be better suited for.
I am trying to understand the basic differences between Tiff and BigTiff. I have looked on various sites and the only difference that is mentioned is that BigTiff uses 64-bit offsets while Tiff uses 32-bit offsets. That being said, you would need to know which of the two types you are reading. How is this done? According to https://www.leadtools.com/help/leadtools/v19/main/api/tifffmt.html, this is done by reading a file flag. However, the flag they are referring to appears to be unique to their own reader as I cannot find a corresponding data field in the specifications as shown by http://www.fileformat.info/format/tiff/egff.htm. What am I missing? Does BigTiff use a different file header than Tiff?
Everything you need to know is described in the BigTIFF link posted by #cgohlke. This is just to provide an answer to your question:
Yes, it uses a different file header.
Normal TIFF uses the following header:
2 byte byte order mark, "II" for "Intel"/little endian, or "MM" for "Motorola"/big endian.
The (version) number 42* as a 16 bit value, in the endianness given.
Unsigned 32 bit offset to IFD0
BigTIFF uses a slightly different header:
2 byte byte order mark as above
The (version) number 43 as a 16 bit value, in the endianness given.
Byte size of offset as a 16 bit value, always 8 for BigTIFF
2 byte padding, always 0 for BigTIFF
Unsigned 64 bit offset to IFD0
*) The value 42 was chosen for its "deep philosophical significance". Or according to the official specification, "[a]n arbitrary but carefully chosen number"...

Correct way to store a bit array

I'm working on a project that needs to store something like
101110101010100011010101001
into the database. It's not a file or archive: it's only a bit array, and I think that storing it into a varchar column is waste of space/performance.
I've searched about the BLOB and the VARBINARY type. But both of then allows to insert a value like 54563423523515453453, that's not exactly a bit array.
For sure, if I store a bit array like 10001000 into a BLOB/varbinary/varchar column, it will consume more than a byte, and I want that the minimum space is consumed. In the case of eight bits, it needs to consume only one byte, 16 bits two bytes, and so on.
If it's not possible, then what is the best approach to waste the minimum amount of space in this case?
Important notes: The size of the array is variable, and is not divisible by eight in every situation. Sometimes I will need to store 325 bits, other times 7143 bits....
In one of my previous projects, I converted streams of 1's and 0' to decimal, but they were shorter. I dont know if that would be applicable in your project.
On the other hand, imho, you should clarify what will you need to do with that data once you get it stored. Search? Compare? It might largely depend on the purpose of the database.
Could you gzip it and then store it? Is that applicable?
Binary is a string representation of a number. The string
101110101010100011010101001
represents the number
... + 1*25 + 0*24 + 1*23 + 0*22 + 0*21 + 1*20
As such, it can be stored in a 32-bit integer if were to be converted from a binary string to the number it represents. In Perl, one would use
oct('0b'.$binary)
But you have a variable number of bits. Not a problem! Just process them 8 at a time to create a string of bytes to place in a BLOB or similar.
Ah, but there's a catch. You'll need to add padding to get a number divisible by 8, which means you'll have to use a means of removing that padding. A simple approach if there's a known maximum length is to use a length prefix. e.g. If you know the number of bits is never going to exceed 65,535, encode the number of bits in the first two bytes of the string.
pack('nB*', length($binary), $binary)
which is reverted using
my ($length, $binary) = unpacked('nB*', $packed);
substr($binary, $length) = '';

Why is it useful to know how to convert between numeric bases?

We are learning about converting Binary to Decimal (and vice-versa) as well as other base-conversion methods, but I don't understand the necessity of this knowledge.
Are there any real-world uses for converting numbers between different bases?
When dealing with Unicode escape codes— '\u2014' in Javascript is — in HTML
When debugging— many debuggers show all numbers in hex
When writing bitmasks— it's more convenient to specify powers of two in hex (or by writing 1 << 4)
In this article I describe a concrete use case. In short, suppose you have a series of bytes you want to transfer using some transport mechanism, but you cannot simply pass the payload as bytes, because you are not able to send binary content. Let's say you can only use 64 characters for encoding the payload. A solution to this problem is to convert the bytes (8-bit characters) into 6-bit characters. Here the number conversion comes into play. Consider the series of bytes as a big number whose base is 256. Then convert it into a number with base 64 and you are done. Each digit of the new base 64 number now denotes a character of your encoded payload...
If you have a device, such as a hard drive, that can only have a set number of states, you can only count in a number system with that many states.
Because a computer's byte only have on and off, you can only represent 0 and 1. Therefore a base2 system is used.
If you have a device that had 3 states, you could represent 0, 1 and 2, and therefore count in a base 3 system.

Compressing a binary matrix

We were asked to find a way to compress a square binary matrix as much as possible, and if possible, to add redundancy bits to check and maybe correct errors.
The redundancy thing is easy to implement in my opinion. The complicated part is compressing the matrix. I thought about using run-length after reshaping the matrix to a vector because there will be more zeros than ones, but I only achieved a 40bits compression (we are working on small sizes) although I thought it'd be better.
Also, after run-length an idea was Huffman coding the matrix, but a dictionary must be sent in order to recover the original information.
I'd like to know what would be the best way to compress a binary matrix?
After reading some comments, yes #Adam you're right, the 14x14 matrix should be compressed in 128bits, so if I only use the coordinates (rows&cols) for each non-zero element, still it would be 160bits (since there are twenty ones). I'm not looking for an exact solution but for a useful idea.
You can only talk about compressing something if you have a distribution and a representation. That's the issue of the dictionary you have to send along: you always need some sort of dictionary of protocol to uncompress something. It just so happens that things like .zip and .mpeg already have those dictionaries/codecs. Even something as simple as Huffman-encoding is an algorithm; on the other side of the communication channel (you can think of compression as communication), the other person already has a bit of code (the dictionary) to perform the Huffman decompression scheme.
Thus you cannot even begin to talk about compressing something without first thinking "what kinds of matrices do I expect to see?", "is the data truly random, or is there order?", and if so "how can I represent the matrices to take advantage of order in the data?".
You cannot compress some matrices without increasing the size of other objects (by at least 1 bit). This is bad news if all matrices are equally probable, and you care equally about them all.
Addenda:
The answer to use sparse matrix machinery is not necessarily the right answer. The matrix could for example be represented in python as [[(r+c)%2 for c in range (cols)] for r in range(rows)] (a checkerboard pattern), and a sparse matrix wouldn't compress it at all, but the Kolmogorov complexity of the matrix is the above program's length.
Well, I know every matrix will have the same number of ones, so this is kind of deterministic. The only think I don't know is where the 1's will be. Also, if I transmit the matrix with a dictionary and there are burst errors, maybe the dictionary gets affected so... wouldnt be the resulting information corrupted? That's why I was trying to use lossless data compression such as run-length, the decoder just doesnt need a dictionary. --original poster
How many 1s does the matrix have as a fraction of its size, and what is its size (NxN -- what is N)?
Furthermore, this is an incorrect assertion and should not be used as a reason to desire run-length encoding (which still requires a program); when you transmit data over a channel, you can always add error-correction to this data. "Data" is just a blob of bits. You can transmit both the data and any required dictionaries over the channel. The error-correcting machinery does not care at all what the bits you transmit are for.
Addendum 2:
There are (14*14) choose 20 possible arrangements, which I assume are randomly chosen. If this number was larger than 128^2 what you're trying to do would be impossible. Fortunately log_2((14*14) choose 20) ~= 90bits < 128bits so it's possible.
The simple solution of writing down 20 numbers like 32,2,67,175,52,...,168 won't work because log_2(14*14)*20 ~= 153bits > 128bits. This would be equivalent to run-length encoding. We want to do something like this but we are on a very strict budget and cannot afford to be "wasteful" with bits.
Because you care about each possibility equally, your "dictionary"/"program" will simulate a giant lookup table. Matlab's sparse matrix implementation may work but is not guaranteed to work and is thus not a correct solution.
If you can create a bijection between the number range [0,2^128) and subsets of size 20, you're good to go. This corresponds to enumerating ways to descend the pyramid in http://en.wikipedia.org/wiki/Binomial_coefficient to the 20th element of row 196. This is the same as enumerating all "k-combinations". See http://en.wikipedia.org/wiki/Combination#Enumerating_k-combinations
Fortunately I know that Mathematica and Sage and other CAS software can apparently generate the "5th" or "12th" or arbitrarily numbered k-subset. Looking through their documentation, we come upon a function called "rank", e.g. http://www.sagemath.org/doc/reference/sage/combinat/subset.html
So then we do some more searching, and come across some arcane Fortran code like http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_rank.m and http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_unrank.m
We could reverse-engineer it, but it's kind of dense. But now we have enough information to search for k-subset rank unrank, which leads us to http://www.site.uottawa.ca/~lucia/courses/5165-09/GenCombObj.pdf -- see the section
"Generating k-subsets (of an n-set): Lexicographical
Ordering" and the rank and unrank algorithms on the next few pages.
In order to achieve the exact theoretically optimal compression, in the case of a uniformly random distribution of 1s, we must thus use this technique to biject our matrices to our output number of range <2^128. It just so happens that combinations have a natural ordering, known as ranking and unranking of combinations. You assign a number to each combination (ranking), and if you know the number you automatically know the combination (unranking). Googling k-subset rank unrank will probably yield other algorithms.
Thus your solution would look like this:
serialize the matrix into a list
e.g. [[0,0,1][0,1,1][1,0,0]] -> [0,0,1,0,1,1,1,0,0]
take the indices of the 1s:
e.g. [0,0,1,0,1,1,1,0,0] -> [3,5,6,7]
1 2 3 4 5 6 7 8 9 a k=4-subset of an n=9 set
take the rank
e.g. compressed = rank([3,5,6,7], n=9)
compressed==412 (or something, I made that up)
you're done!
e.g. 412 -binary-> 110011100 (at most n=9bits, less than 2^n=2^9=512)
to uncompress, unrank it
I'll get to 128 bits in a sec, first here's how you fit a 14x14 boolean matrix with exactly 20 nonzeros into 136 bits. It's based on the CSC sparse matrix format.
You have an array c with 14 4-bit counters that tell you how many nonzeros are in each column.
You have another array r with 20 4-bit row indices.
56 bits (c) + 80 bits (r) = 136 bits.
Let's squeeze 8 bits out of c:
Instead of 4-bit counters, use 2-bit. c is now 2*14 = 28 bits, but can't support more than 3 nonzeros per column. This leaves us with 128-80-28 = 20 bits. Use that space for array a4c with 5 4-bit elements that "add 4 to an element of c" specified by the 4-bit element. So, if a4c={2,2,10,15, 15} that means c[2] += 4; c[2] += 4 (again); c[10] += 4;.
The "most wasteful" distribution of nonzeros is one where the column count will require an add-4 to support 1 extra nonzero: so 5 columns with 4 nonzeros each. Luckily we have exactly 5 add-4s available.
Total space = 28 bits (c) + 20 bits
(a4c) + 80 bits (r) = 128 bits.
Your input is a perfect candidate for a sparse matrix. You said you're using Matlab, so you already have a good sparse matrix built for you.
spm = sparse(dense_matrix)
Matlab's sparse matrix implementation uses Compressed Sparse Columns, which has memory usage on the order of 2*(# of nonzeros) + (# of columns), which should be pretty good in your case of 20 nonzeros and 14 columns. Storing 20 values sure is better than storing 196...
Also remember that all matrices in Matlab are going to be composed of doubles. Just because your matrix can be stored as a 1-bit boolean doesn't mean Matlab won't stick it into a 64-bit floating point value... If you do need it as a boolean you're going to have to make your own type in C and use .mex files to interface with Matlab.
After thinking about this again, if all your matrices are going to be this small and they're all binary, then just store them as a binary vector (bitmask). Going off your 14x14 example, that requires 196 bits or 25 bytes (plus n, m if your dimensions are not constant). That same vector in Matlab would use 64 bits per element, or 1568 bytes. So storing the matrix as a bitmask takes as much space as 4 elements of the original matrix in Matlab, for a compression ratio of 62x.
Unfortunately I don't know if Matlab supports bitmasks natively or if you have to resort to .mex files. If you do get into C++ you can use STL's vector<bool> which implements a bitmask for you.

What transformations are used by little-endian systems to convert data to network order?

What are the underlying transformations that are necessary to convert data in a little-endian system into network byte order? For 2 byte and 4 byte data there are well-known functions (such as htons, ntohl, etc.) to encapsulate the changes, what happens for strings of 1 byte data (if anything)?
Also, Wikipedia implies that little-endian is the mirror image of big-endian, but if that were true why would we need specific handling for 2 and 4 byte data?
The essay "On Holy Wars and a Plea for Peace" seems to imply that there are many different flavors of little-endian -- it's an old essay -- does that still apply? Are byte order markers like the ones found at the beginning of Java class files still necessary?
And finally, is 4-byte alignment necessary for network-byte order?
Let's say you have the ASCII text "BigE" in an array b of bytes.
b[0] == 'B'
b[1] == 'i'
b[2] == 'g'
b[3] == 'E'
This is network order for the string as well.
If it was treated as a 32 bit integer, it would be
'B' + ('i' << 8) + ('g' << 16) + ('E' << 24)
on a little endian platform and
'E' + ('g' << 8) + ('i' << 16) + ('B' << 24)
on a big endian platform.
If you convert each 16-bit work separately, you'd get neither of these
'i' + ('B' << 8) + ('E' << 16) + ('g' << 24)
which is why ntohl and ntohs are both required.
In other words, ntohs swaps bytes within a 16-bit short, and ntohl reverses the order of the four bytes of its 32-bit word.
Specific handling functions for 2 and 4 byte data take advantage of the fact that there are processor instructions that operate on specific data sizes. Running a 1-byte reversing function four times is certainly less efficient than using wider instructions to perform the same (albeit increased in scale) operations on all four bytes at once.
1 byte data doesn't require any conversion between endians (it's an advantage of UTF-8 over UTF-16 and UTF-32 for string encoding).
is 4-byte alignment necessary for network-byte order?
No specific alignment is necessary for bytes going over a network. Your processor may demand a certain alignment in memory, but it's up to you to resolve the discrepancy. The x86 family usually doesn't make such demands.
The basic idea is that all multi-byte types have to have the order of their bytes reversed. A four byte integer would have bytes 0 and 3 swapped, and bytes 1 and 2 swapped. A two byte integer would have bytes 0 and 1 swapped. A one byte character does not get swapped.
There are two very important implications of this that non-practicioners and novices don't always realise:
(ASCII) Character strings are not touched.
There is no possible blind algorithm to byte swap generic "data". You have to know the type of all your data and swap each item in the manner required for its type.