Deflate length of 258 double encoding - language-agnostic

In Deflate algorithm there are two ways to encode a length of 258:
Code 284 + 5 extra bits of all 1's
Code 285 + 0 extra bits;
On first glance, this is not optimal, because the proper use of code 285 would allow a length of 259 be encoded;
Is this duality some specification mistake, not fixed because of compatibility reasons, or there are some arguments about it - for example length of 258 must be encoded with shorter code (0 extra bits) because of some reason?

We may never know. The developer of the deflate format, Phil Katz, passed away many years ago at a young age.
My theory is that a match length was limited to 258 so that a match length in the range 3..258 could fit in a byte, encoded as 0..255. This format was developed around 1990, when this might make a difference in an assembler implementation.

Adding a second answer here to underscore Mark's guess that allowing the length to be encoded in a byte is helpful to assembler implementations. At the time 8086 level assembler was still common and using the 8 bit form of registers gave you more of them to work with than using them in 16 bit size.
The benefit is even more pronounced on 8 bit processors such as the 6502. It starts with the length decoding. Symbols 257 .. 264 represent a match length of 3 .. 10 respectively. If you take the low byte of those symbols (1 .. 8) you get exactly 2 less than the match length.
A more complicated yet fairly easy to compute formula gives 2 less than the match length of symbols 265 through 284. 2 less than the match length of symbol 285 is 256. That doesn't fit in a byte but we can store 0 which turns out to be equivalent.
zlib6502 uses this for considerable advantage. It calculates the match length in inflateCodes_lengthMinus2. And once the back pointer into the window has been determined it copies the data like so:
jsr copyByte
jsr copyByte
inflateCodes_copyByte
jsr copyByte
dec inflateCodes_lengthMinus2
bne inflateCodes_copyByte
It makes two explicit calls to copy a byte and then loops over the length less 2. Which works as you would expect for lengths 1 to 255. For length 0 it will actually iterate 256 times as we desire. The first time through the loop the length of 0 is decremented to 255 which is non-zero so the loop continues 255 more times for a total of 256.
I'd have to think that Phil Katz understood intuitively if not explicitly the benefits of keeping the length of matches within 8 bits.

Related

MySQL character set for numbers compression

I would like to store many numbers in 1 cell and save space. The idea is to encode each one to a string of a constant length and store them in a text field (presumably MEDIUMTEXT). What characters can be used so that they are 1 byte only? I assume that special characters are stored in a way that uses more than 1 byte. I can use e.g. base64 but I am not sure how many encoding characters I can add to the base before MySQL uses actually more space to store them than I manage to save.
You say "numbers". What do you mean, really?
Digits? See above.
Integers? (no decimal point, no fraction)
Floats? (with exponent, etc)
Some notes on digits, compression, etc:
1 byte per Ascii character -- 8 bits
1 byte per digit, since it is an ascii character
One random digit, at maximum theoretical compression, is about 0.332 bytes. Visualize it this way: 1000 is 3 digits and 1024 is 10 bits.
MySQL's DECIMAL datatype puts 2 digits into one byte for smaller numbers; for larger numbers it stores 9 digits in 4 bytes.
If you zip up a million digits of pi, it will come very close to the above compression.
A simple Rule of Thumb is that "text" compresses 3:1.
Base64 expands bytes by 8/6 because one 8-bit byte is represented by 64 (2^6) different characters.
Base64 is more useful for avoiding special characters; it is not really a compression technique.
A 4-byte MySQL INT (range of -2 billion to +2 billion, but usually just positive and not evenly distributed), when converted to base64 would take more than 5 bytes for 9-10 digits.
General techniques
Client compression: For 123,2345,88,22, here is one way to handle it. In fact, I recommend this for virtually any text handling where compression is desired in MySQL.
use compress() (or similar function) in your client.
use BLOB (up to 64KB) or MEDIUMBLOB (up to 16MB) in the table
use uncompress() after retrieving the blob.
For an array of numbers, use json_encode for the array, then feed into compress+blob, above. It will work for any sized "numbers" and provide nearly maximal compresssion.
You cannot efficiently reach into a MEDIUMTEXT or BLOB to get one number out of an array. The entire cell will be fetched.
That leads to another general statement... If you have a lot of stuff that you don't need to sort on, nor fetch individually, JSON is a good approach. Think of it as from MySQL's point of view an opaque blob. The application writes and rereads it as one big thing, then picks it apart.
The JSON will possibly encode the above example as ["123","2345","88","22"], which will be slightly fatter after compression. But, any good compression algorithm will notice and take advantage of the repetition.
Take advantage of the data
17,22738 48,77795 300
17,22792 48,77795 297
17,22853 48,77764 294
17,22874 48,77743 297
17,22887 48,77704 300
17,22968 48,77671 305
17,23069 48,77563 296
17,23092 48,77561 292
-->
17,22738 48,77795 300
54 0 -3
61 -31 -3
21 -21 3
13 -39 3
81 -33 5
1 -108 -9
23 -2 -4
The numbers stay relatively constant. Take advantage of it by starting with raw data, but then switching to deltas. Try it will about 10 times as much data; I suspect you will continue to get better than 2x compression before zipping, but maybe slightly less than 2x after zipping. (Zipping can take advantage of the repetition of 48,777; I am taking more advantage of it by tossing most of it.)

About MIPS lb and sb and endianness

I just read a comment by #Cheshar in this answer - Loading and storing bytes in MIPS.
This is my reasoning regarding his first point: the value in $t0 should be 0xFFFFFF90 (i.e. it's sign-extended) but this won't change the result of mem(4) (I think this means read the word started at 0x04) which is still FFFF90FF. Am I correct?
But I'm not sure about his second point:
["] lb and sb doesn't care for endianness. [."]
I'm thinking about why
the changing from big endian to little endian is
byte: 0 1 2 3 ----\ 3 2 1 0
00 90 12 A0 ----/ 00 90 12 A0
so it seems like individual byte is still read like big endian?
lb/sb do not care about endianess. There is no endianess for a single byte.
It only matters if you store a big/little endian [(e.g.) 4 byte] number and then try to access it byte-by-byte.
The byte offsets do not change, so a better diagram might be:
byte: 0 1 2 3 ----\ 0 1 2 3
00 90 12 A0 ----/ A0 12 90 00
If $t1 points to your stored integer, when you do:
lb $t0,1($t1)
You get 90 for little endian and 12 for big endian.
UPDATE:
I upvote your answer since it's clean. But didn't you think that is counter intuitive before? Since in little endian the 32-bit integer has no meaning when read 32-bit all together from either left to right or right to left...?
Once the data is in a register (via lw), we visualize and operate on it as big endian (i.e. a left shift by 1 is a multiply by 2).
The decimal value 123 is "big endian" (one hundred + twenty + three).
Little endian is merely the byte order when we fetch from or store into memory. The hardware will shuffle the bytes as needed.
The advantage of little endian is that it works better for large multiprecision numbers (e.g. libgmp).
And, when Intel first came out with the 8 bit 8080 processor (with only a single byte memory bus), little endian made things faster. For example, when doing an add, after fetching the LSB at offset 0, it could do the add of the two LSB bytes in parallel with the fetch of MSB at offset 1.
To give an example: 8-bit (unsigned) integer b00100001 is 33(decimal), but with little endian it is stored as b00010010, which is 18(decimal) when read from left to right, and b01001000, which is 64+8=72(decimal) when read from right to left, bit by bit.
While is is possible for a [theoretical] computer architecture to behave as you describe, no modern one [that I'm aware of] does. That's partly because to do it requires more complex circuitry.
However, I once wrote a multiprecision math package that did use little endian bytes and little endian bits within the bytes. But, it was slow. This is sometimes useful for large bit vectors (e.g. 500,000 bits wide)
Or my idea is completely wrong since computer can only see byte as an abstraction of underlying bits.
The endianess of bits in a byte is the same (big endian), regardless of whether the byte is in a register or in a memory cell.
The different endianess only pertains to multibyte integers (e.g. in C, int or short).

Minimum register length required to store values between -64 (hex) and 128 (hex)?

What is the minimum register length in a processor required to store values between -64 (hex) and 128 (hex), assuming 2's complement format?
I was thinking an 8 bit register since a 2's complement of 8 bit register goes from 0 to 255.
Am I correct?
Probably you've used the wrong term. 0x64 and 0x128 are very rarely used as hex values. And if you do mean those values then obviously you can't store that big range with 8 bits. 0x128 - (-0x64) = 0x18C which needs at least 9 bits to store
OTOH 64 and 128 are extremely common values, because they're powers of 2. Using the common 2's complement encoding would also cost you 9 bits (because 128 is outside an 8-bit two's complement range) and waste a lot of unused values. But in fact there's almost no 9-bit system so you'll have to use 16-bit shorts. Hence if you want to save memory, the only way is using your own encoding.
In case that you want the value only for storage, almost any encoding is appropriate. For example, using int8_t with -64 to 127 as normal and a special case for 128 (-128, -65... any number you prefer), or an uint8_t from 0 to 192 and map the values linearly. Just convert to and from the correct value when load/store. Operations still need to be done in a type wider than 8 bits, but the on-disk size is only 8 bits
If you need the value for calculation, more care should be taken. For example you could use the excess-64 encoding in which the binary 0 represents -64, 192 represents 128, or generally a would be represented by a - 64. After each calculation you'll have to readjust the value for the correct representation. For example if A and B are stored as a and b which are A - 64 and B - 64 respectively then A + B will be done as a + b + 64 (as we've subtracted 64 one more than expected)
You would not be correct even if it was 128 (Decimal) max only. Since your using 2's complement the range is actually from −(2N−1) to +(2N−1 − 1) where N is the number of bits. So 8 bits would have a range of −128 to 127 (Decimal).
Since you present it as actually -64 (Hex) to 128 (Hex) you are actually looking at -100 (Decimal) to 296 (Decimal). Adding a bit you increase the range up to -256 to 255 and one last addition gets you to -512 to 511. Making the necessary amount needed as 10 bits.
Now make sure that you were not dealing with -64 to 128 (Decimal). As I pointed out earlier the 8 bit range only goes to 127 which would make it a very tricky question if you were not on your toes. Then it would be 9 bits.
In two's complement, an 8-bit register will range from -128 to +127. To get the upper bound, you fill the lower 7 bits with 1s: 01111111 is 127 in decimal. To get the lower bound, you set the highest bit to 1 and the rest to 0: 10000000 is -128 in two's complement.
Those hex values seem a bit odd (they're powers of two in decimal), but in any case: 0x128 (the 0x is a standard prefix for hex numbers) is the larger of the numbers in magnitude, and its binary representation is 100101000. You need to be able to represent those nine bits after the sign bit. So to be able to use two's complement, you'd need at least ten bits.

How to used the alphabet binary symbols

I was reading an article on binary numbers and it had some practice problems at the end but it didn't give the solutions to the problems. The last is "How many bits are required to represent the alphabet?". Can tell me the answer to that question and briefly explain why?
Thanks.
You would only need 5 bits because you are counting to 26 (if we take only upper or lowercase letters). 5 bits will count up to 31, so you've actually got more space than you need. You can't use 4 because that only counts to 15.
If you want both upper and lowercase then 6 bits is your answer - 6 bits will happily count to 63, while your double alphabet has (2 * 24 = 48) characters, again leaving plenty of headroom.
It depends on your definition of alphabet. If you want to represent one character from the 26-letter Roman alphabet (A-Z), then you need log2(26) = 4.7 bits. Obviously, in practice, you'll need 5 bits.
However, given an infinite stream of characters, you could theoretically come up with an encoding scheme that got close to 4.7 bits (there just won't be a one-to-one mapping between individual characters and bit vectors any more).
If you're talking about representing actual human language, then you can get away with a far lower number than this (in the region of 1.5 bits/character), due to redundancy. But that's too complicated to get into in a single post here... (Google keywords are "entropy", and "information content").
There are 26 letters in the alphabet so you 2^5 = 32 is the minimum word length than contain all the letters.
How direct does the representation need to be? If you need 1:1 with no translation layer, then 5 bits will do. But if a translation layer is an option, then you can get away with less. Morse code, for example, can do it in 3 bits. :)

Are "65k" and "65KB" the same?

Are "65k" and "65KB" the same?
From xkcd:
65KB normally means 66560 bytes. 65k means 65000, and says nothing about what it is 65000 of. If someone says 65k bytes, they might means 65KB...but they're mispeaking if so. Some people argue for the use of KiB to mean 66560 bytes, since k means 1000 in the metric system. Everyone ignores them, though.
Note: a lowercase b would mean bit, rather than bytes. 8Kb = 1KB. When talking about transmission rates, bits are usually used.
Edit: As Joel mentions, hard drive manufacturers often treat the K as meaning 1000. So hard disk space of 65KB would often mean 65000. Thumb drives and the like tend to use K as meaning 1024, though.
Probably.
Technically 65k just means 65 thousand (monkeys perhaps?). You would have to take into account the context.
65kB can be interpreted to mean either 65 * 1000 = 65,000 bytes or 60 * 2^10 = 66,560 bytes.
You can read about all this and kibibytes at Wikipedia.
65k is 65,000 of something
65KB is 66,560 bytes (65*1024)
Like most have said, 65KB is 66560, 65k is 65000. 65KB means 66560 BYTES, and 65k is ambiguous. So they're not the same.
Additionally, since there are a few people equating "8 bits = 1 byte", I thought I'd add a little bit about that.
Transmission rates are usually in bits per second, because the grouping into bytes might not be directly related to the actual transmission clock rate.
Take for instance 9600 baud with RS232 serial ports. There are always exactly 9600 bits going out per second (+/- maybe a 5% clock tolerance). However, if those bits are grouped as N-8-1, meaning "no parity, 8 bits, 1 stop bit", then there are 10 bits per byte and so the byte rate is 960 bytes/second maximum. However, if you have something odd like E-8-2, or "even parity, 8 bits, 2 stop bits" then it's 12 bits per byte, or 800 bytes/second. The actual bits are going out at exactly the same rate, so it only makes sense to talk about the bits/second rate.
So 1 byte might be 8 bits, 9 bits (ie parity), 10 bits (ie N81,E71,N72), 11 bits(ie E81), 12 bits (ie E82), or whatever. There are lots of combinations of ways with just RS232-style transmission to get very odd byte rates. If you throw in RS or ECC correction, you could have even more bits per byte. Then there's 8b/10b, 6b/8b, hamming codes, etc...
In terms of data transfer rates - 65k implies 65 kilobits and 65KB implies 65 KiloBytes
Check this http://en.wikipedia.org/wiki/Data_rate_units
cheers
From Wikipedia for Kilobyte:
It is abbreviated in a number of ways: KB, kB, K and Kbyte.
In other words, they could both be abbreviations for Kilobyte. However, using only a lowercase 'k' is not a standard abbreviation, but most people will know what you mean.
There you go:
kB = kiloByte
KB = KelvinByte
kb = kilobit
Kb = Kelvinbit
Use the bold ones! But be aware that some people use 1024 instead of 1000 for k (kilo).
My opinion on this: kilo = 1000. So the first one who decided to use 1024 made the mistake. If I am not mistaken 1024 was used first by IT engineers. Later they found out (probably some marketing genius) that they can label things using 1000 as kilo and make things look bigger than they actualy are. Since then, you can't be sure which value is used for kilo.
In general, yes, they're both 65 kilobytes (66,560 bytes).
Sometimes the abbreviations are tricky with casing. If it had been "65Kb", it would have correctly meant kilo***bits***.
A kilobyte (KB) is 1024 bytes.
Kilo stands for 1000.
So, going purely by notation: (65k = 65,000) != (65KB = 66,560).
However, if you're talking about memory you're probably always going to see KB (even if its written as k).
Generally, KB = k. It's all very confusing really.
Strictly speaking, the former is not specifying the unit: 65,000 What? So, the two can't really be compared.
However, in general speech then most people mean 65K (note it's normally uppercase) to mean 65 KiloBytes (or 65 * 1024 Bytes).
Note 65Kb usually denotes KiloBits.
"Officially", 65k is 65,000; however people say 65k all the time, even if the real number is something like 65,123.
Typically 65k means anywhere from 64.00001 to 65.99999998 KiB or sometimes anywhere between 63500 and 64999 bytes ... ie, we aren't all the precise most of the time with sizes of things. When someone cares to be precise, they will be explicit, or the meaning will be clear from context.
65 KiB means 65 * 1024 bytes. .... unless the person was rounding. Never trust a number unless you measure it yourself! ... :)
Hope that helps,
--- Dave
65k may be the same as 65KB, but remember, 65KB is larger than 65Kb.
Case is important, as are units.
Psto, you're right. This is an absolute minefield!
As many said, K is tecnically Kilo, meaning Thousand (of anything) and comes from greek.
But you can assume different units depending on the context.
As data transfer rates are most often measured in bits, K in this context could be assumed to be Kilo Bits.
When talking about data storage, a file's size, etc. K can be assumed to be Kilo Bytes.