Compressing 5 Bytes into 4 Bytes - binary

I am trying to understand more about BCD, packed/unpacked byte arrays and byte shifting, in this picture I made an example of the instructions which I am unsure about, an array of 5 bytes is being compressed or packed into an array of 4 bytes, and vice versa. It converts 0x93, 0x87, 0xE3, 0xC3, 0x17 into 0xBC, 0x3C, 0x61, 0xD3.Can anyone tell me if there is a generalized term for the method being used or is there decent documentations somewhere about this method or similar?
Many thanks

Related

MySQL character set for numbers compression

I would like to store many numbers in 1 cell and save space. The idea is to encode each one to a string of a constant length and store them in a text field (presumably MEDIUMTEXT). What characters can be used so that they are 1 byte only? I assume that special characters are stored in a way that uses more than 1 byte. I can use e.g. base64 but I am not sure how many encoding characters I can add to the base before MySQL uses actually more space to store them than I manage to save.
You say "numbers". What do you mean, really?
Digits? See above.
Integers? (no decimal point, no fraction)
Floats? (with exponent, etc)
Some notes on digits, compression, etc:
1 byte per Ascii character -- 8 bits
1 byte per digit, since it is an ascii character
One random digit, at maximum theoretical compression, is about 0.332 bytes. Visualize it this way: 1000 is 3 digits and 1024 is 10 bits.
MySQL's DECIMAL datatype puts 2 digits into one byte for smaller numbers; for larger numbers it stores 9 digits in 4 bytes.
If you zip up a million digits of pi, it will come very close to the above compression.
A simple Rule of Thumb is that "text" compresses 3:1.
Base64 expands bytes by 8/6 because one 8-bit byte is represented by 64 (2^6) different characters.
Base64 is more useful for avoiding special characters; it is not really a compression technique.
A 4-byte MySQL INT (range of -2 billion to +2 billion, but usually just positive and not evenly distributed), when converted to base64 would take more than 5 bytes for 9-10 digits.
General techniques
Client compression: For 123,2345,88,22, here is one way to handle it. In fact, I recommend this for virtually any text handling where compression is desired in MySQL.
use compress() (or similar function) in your client.
use BLOB (up to 64KB) or MEDIUMBLOB (up to 16MB) in the table
use uncompress() after retrieving the blob.
For an array of numbers, use json_encode for the array, then feed into compress+blob, above. It will work for any sized "numbers" and provide nearly maximal compresssion.
You cannot efficiently reach into a MEDIUMTEXT or BLOB to get one number out of an array. The entire cell will be fetched.
That leads to another general statement... If you have a lot of stuff that you don't need to sort on, nor fetch individually, JSON is a good approach. Think of it as from MySQL's point of view an opaque blob. The application writes and rereads it as one big thing, then picks it apart.
The JSON will possibly encode the above example as ["123","2345","88","22"], which will be slightly fatter after compression. But, any good compression algorithm will notice and take advantage of the repetition.
Take advantage of the data
17,22738 48,77795 300
17,22792 48,77795 297
17,22853 48,77764 294
17,22874 48,77743 297
17,22887 48,77704 300
17,22968 48,77671 305
17,23069 48,77563 296
17,23092 48,77561 292
-->
17,22738 48,77795 300
54 0 -3
61 -31 -3
21 -21 3
13 -39 3
81 -33 5
1 -108 -9
23 -2 -4
The numbers stay relatively constant. Take advantage of it by starting with raw data, but then switching to deltas. Try it will about 10 times as much data; I suspect you will continue to get better than 2x compression before zipping, but maybe slightly less than 2x after zipping. (Zipping can take advantage of the repetition of 48,777; I am taking more advantage of it by tossing most of it.)

Bit numbering in binary data with little endianness

More formally, the 16 bytes of plaintext p0 ,..., p15 are first split into 4 words P0,...,P3 of 32 bits each using the little-endian convention.
In the above sentence, the plain word for example this_is__awesome contains 16 characters, I wonder which side is p0 and after splitting the plain text into 4 words this, _is_, _awe, some, which of these is P0?
Considering the term 'little endianess', 0x1234 would result in 0x34 0x12, So I wonder this would become [some, _awe, _is_, this] if we apply little endianess to words, or would it become [emos, ewa_, _si_, siht] if we apply little endianess at byte level.
and finally what would Pi become for the plain text this_is__awesome
Please help me out with this.

About MIPS lb and sb and endianness

I just read a comment by #Cheshar in this answer - Loading and storing bytes in MIPS.
This is my reasoning regarding his first point: the value in $t0 should be 0xFFFFFF90 (i.e. it's sign-extended) but this won't change the result of mem(4) (I think this means read the word started at 0x04) which is still FFFF90FF. Am I correct?
But I'm not sure about his second point:
["] lb and sb doesn't care for endianness. [."]
I'm thinking about why
the changing from big endian to little endian is
byte: 0 1 2 3 ----\ 3 2 1 0
00 90 12 A0 ----/ 00 90 12 A0
so it seems like individual byte is still read like big endian?
lb/sb do not care about endianess. There is no endianess for a single byte.
It only matters if you store a big/little endian [(e.g.) 4 byte] number and then try to access it byte-by-byte.
The byte offsets do not change, so a better diagram might be:
byte: 0 1 2 3 ----\ 0 1 2 3
00 90 12 A0 ----/ A0 12 90 00
If $t1 points to your stored integer, when you do:
lb $t0,1($t1)
You get 90 for little endian and 12 for big endian.
UPDATE:
I upvote your answer since it's clean. But didn't you think that is counter intuitive before? Since in little endian the 32-bit integer has no meaning when read 32-bit all together from either left to right or right to left...?
Once the data is in a register (via lw), we visualize and operate on it as big endian (i.e. a left shift by 1 is a multiply by 2).
The decimal value 123 is "big endian" (one hundred + twenty + three).
Little endian is merely the byte order when we fetch from or store into memory. The hardware will shuffle the bytes as needed.
The advantage of little endian is that it works better for large multiprecision numbers (e.g. libgmp).
And, when Intel first came out with the 8 bit 8080 processor (with only a single byte memory bus), little endian made things faster. For example, when doing an add, after fetching the LSB at offset 0, it could do the add of the two LSB bytes in parallel with the fetch of MSB at offset 1.
To give an example: 8-bit (unsigned) integer b00100001 is 33(decimal), but with little endian it is stored as b00010010, which is 18(decimal) when read from left to right, and b01001000, which is 64+8=72(decimal) when read from right to left, bit by bit.
While is is possible for a [theoretical] computer architecture to behave as you describe, no modern one [that I'm aware of] does. That's partly because to do it requires more complex circuitry.
However, I once wrote a multiprecision math package that did use little endian bytes and little endian bits within the bytes. But, it was slow. This is sometimes useful for large bit vectors (e.g. 500,000 bits wide)
Or my idea is completely wrong since computer can only see byte as an abstraction of underlying bits.
The endianess of bits in a byte is the same (big endian), regardless of whether the byte is in a register or in a memory cell.
The different endianess only pertains to multibyte integers (e.g. in C, int or short).

Deflate length of 258 double encoding

In Deflate algorithm there are two ways to encode a length of 258:
Code 284 + 5 extra bits of all 1's
Code 285 + 0 extra bits;
On first glance, this is not optimal, because the proper use of code 285 would allow a length of 259 be encoded;
Is this duality some specification mistake, not fixed because of compatibility reasons, or there are some arguments about it - for example length of 258 must be encoded with shorter code (0 extra bits) because of some reason?
We may never know. The developer of the deflate format, Phil Katz, passed away many years ago at a young age.
My theory is that a match length was limited to 258 so that a match length in the range 3..258 could fit in a byte, encoded as 0..255. This format was developed around 1990, when this might make a difference in an assembler implementation.
Adding a second answer here to underscore Mark's guess that allowing the length to be encoded in a byte is helpful to assembler implementations. At the time 8086 level assembler was still common and using the 8 bit form of registers gave you more of them to work with than using them in 16 bit size.
The benefit is even more pronounced on 8 bit processors such as the 6502. It starts with the length decoding. Symbols 257 .. 264 represent a match length of 3 .. 10 respectively. If you take the low byte of those symbols (1 .. 8) you get exactly 2 less than the match length.
A more complicated yet fairly easy to compute formula gives 2 less than the match length of symbols 265 through 284. 2 less than the match length of symbol 285 is 256. That doesn't fit in a byte but we can store 0 which turns out to be equivalent.
zlib6502 uses this for considerable advantage. It calculates the match length in inflateCodes_lengthMinus2. And once the back pointer into the window has been determined it copies the data like so:
jsr copyByte
jsr copyByte
inflateCodes_copyByte
jsr copyByte
dec inflateCodes_lengthMinus2
bne inflateCodes_copyByte
It makes two explicit calls to copy a byte and then loops over the length less 2. Which works as you would expect for lengths 1 to 255. For length 0 it will actually iterate 256 times as we desire. The first time through the loop the length of 0 is decremented to 255 which is non-zero so the loop continues 255 more times for a total of 256.
I'd have to think that Phil Katz understood intuitively if not explicitly the benefits of keeping the length of matches within 8 bits.

Why is it useful to know how to convert between numeric bases?

We are learning about converting Binary to Decimal (and vice-versa) as well as other base-conversion methods, but I don't understand the necessity of this knowledge.
Are there any real-world uses for converting numbers between different bases?
When dealing with Unicode escape codes— '\u2014' in Javascript is — in HTML
When debugging— many debuggers show all numbers in hex
When writing bitmasks— it's more convenient to specify powers of two in hex (or by writing 1 << 4)
In this article I describe a concrete use case. In short, suppose you have a series of bytes you want to transfer using some transport mechanism, but you cannot simply pass the payload as bytes, because you are not able to send binary content. Let's say you can only use 64 characters for encoding the payload. A solution to this problem is to convert the bytes (8-bit characters) into 6-bit characters. Here the number conversion comes into play. Consider the series of bytes as a big number whose base is 256. Then convert it into a number with base 64 and you are done. Each digit of the new base 64 number now denotes a character of your encoded payload...
If you have a device, such as a hard drive, that can only have a set number of states, you can only count in a number system with that many states.
Because a computer's byte only have on and off, you can only represent 0 and 1. Therefore a base2 system is used.
If you have a device that had 3 states, you could represent 0, 1 and 2, and therefore count in a base 3 system.