Bit numbering in binary data with little endianness - binary

More formally, the 16 bytes of plaintext p0 ,..., p15 are first split into 4 words P0,...,P3 of 32 bits each using the little-endian convention.
In the above sentence, the plain word for example this_is__awesome contains 16 characters, I wonder which side is p0 and after splitting the plain text into 4 words this, _is_, _awe, some, which of these is P0?
Considering the term 'little endianess', 0x1234 would result in 0x34 0x12, So I wonder this would become [some, _awe, _is_, this] if we apply little endianess to words, or would it become [emos, ewa_, _si_, siht] if we apply little endianess at byte level.
and finally what would Pi become for the plain text this_is__awesome
Please help me out with this.

Related

MySQL character set for numbers compression

I would like to store many numbers in 1 cell and save space. The idea is to encode each one to a string of a constant length and store them in a text field (presumably MEDIUMTEXT). What characters can be used so that they are 1 byte only? I assume that special characters are stored in a way that uses more than 1 byte. I can use e.g. base64 but I am not sure how many encoding characters I can add to the base before MySQL uses actually more space to store them than I manage to save.
You say "numbers". What do you mean, really?
Digits? See above.
Integers? (no decimal point, no fraction)
Floats? (with exponent, etc)
Some notes on digits, compression, etc:
1 byte per Ascii character -- 8 bits
1 byte per digit, since it is an ascii character
One random digit, at maximum theoretical compression, is about 0.332 bytes. Visualize it this way: 1000 is 3 digits and 1024 is 10 bits.
MySQL's DECIMAL datatype puts 2 digits into one byte for smaller numbers; for larger numbers it stores 9 digits in 4 bytes.
If you zip up a million digits of pi, it will come very close to the above compression.
A simple Rule of Thumb is that "text" compresses 3:1.
Base64 expands bytes by 8/6 because one 8-bit byte is represented by 64 (2^6) different characters.
Base64 is more useful for avoiding special characters; it is not really a compression technique.
A 4-byte MySQL INT (range of -2 billion to +2 billion, but usually just positive and not evenly distributed), when converted to base64 would take more than 5 bytes for 9-10 digits.
General techniques
Client compression: For 123,2345,88,22, here is one way to handle it. In fact, I recommend this for virtually any text handling where compression is desired in MySQL.
use compress() (or similar function) in your client.
use BLOB (up to 64KB) or MEDIUMBLOB (up to 16MB) in the table
use uncompress() after retrieving the blob.
For an array of numbers, use json_encode for the array, then feed into compress+blob, above. It will work for any sized "numbers" and provide nearly maximal compresssion.
You cannot efficiently reach into a MEDIUMTEXT or BLOB to get one number out of an array. The entire cell will be fetched.
That leads to another general statement... If you have a lot of stuff that you don't need to sort on, nor fetch individually, JSON is a good approach. Think of it as from MySQL's point of view an opaque blob. The application writes and rereads it as one big thing, then picks it apart.
The JSON will possibly encode the above example as ["123","2345","88","22"], which will be slightly fatter after compression. But, any good compression algorithm will notice and take advantage of the repetition.
Take advantage of the data
17,22738 48,77795 300
17,22792 48,77795 297
17,22853 48,77764 294
17,22874 48,77743 297
17,22887 48,77704 300
17,22968 48,77671 305
17,23069 48,77563 296
17,23092 48,77561 292
-->
17,22738 48,77795 300
54 0 -3
61 -31 -3
21 -21 3
13 -39 3
81 -33 5
1 -108 -9
23 -2 -4
The numbers stay relatively constant. Take advantage of it by starting with raw data, but then switching to deltas. Try it will about 10 times as much data; I suspect you will continue to get better than 2x compression before zipping, but maybe slightly less than 2x after zipping. (Zipping can take advantage of the repetition of 48,777; I am taking more advantage of it by tossing most of it.)

About MIPS lb and sb and endianness

I just read a comment by #Cheshar in this answer - Loading and storing bytes in MIPS.
This is my reasoning regarding his first point: the value in $t0 should be 0xFFFFFF90 (i.e. it's sign-extended) but this won't change the result of mem(4) (I think this means read the word started at 0x04) which is still FFFF90FF. Am I correct?
But I'm not sure about his second point:
["] lb and sb doesn't care for endianness. [."]
I'm thinking about why
the changing from big endian to little endian is
byte: 0 1 2 3 ----\ 3 2 1 0
00 90 12 A0 ----/ 00 90 12 A0
so it seems like individual byte is still read like big endian?
lb/sb do not care about endianess. There is no endianess for a single byte.
It only matters if you store a big/little endian [(e.g.) 4 byte] number and then try to access it byte-by-byte.
The byte offsets do not change, so a better diagram might be:
byte: 0 1 2 3 ----\ 0 1 2 3
00 90 12 A0 ----/ A0 12 90 00
If $t1 points to your stored integer, when you do:
lb $t0,1($t1)
You get 90 for little endian and 12 for big endian.
UPDATE:
I upvote your answer since it's clean. But didn't you think that is counter intuitive before? Since in little endian the 32-bit integer has no meaning when read 32-bit all together from either left to right or right to left...?
Once the data is in a register (via lw), we visualize and operate on it as big endian (i.e. a left shift by 1 is a multiply by 2).
The decimal value 123 is "big endian" (one hundred + twenty + three).
Little endian is merely the byte order when we fetch from or store into memory. The hardware will shuffle the bytes as needed.
The advantage of little endian is that it works better for large multiprecision numbers (e.g. libgmp).
And, when Intel first came out with the 8 bit 8080 processor (with only a single byte memory bus), little endian made things faster. For example, when doing an add, after fetching the LSB at offset 0, it could do the add of the two LSB bytes in parallel with the fetch of MSB at offset 1.
To give an example: 8-bit (unsigned) integer b00100001 is 33(decimal), but with little endian it is stored as b00010010, which is 18(decimal) when read from left to right, and b01001000, which is 64+8=72(decimal) when read from right to left, bit by bit.
While is is possible for a [theoretical] computer architecture to behave as you describe, no modern one [that I'm aware of] does. That's partly because to do it requires more complex circuitry.
However, I once wrote a multiprecision math package that did use little endian bytes and little endian bits within the bytes. But, it was slow. This is sometimes useful for large bit vectors (e.g. 500,000 bits wide)
Or my idea is completely wrong since computer can only see byte as an abstraction of underlying bits.
The endianess of bits in a byte is the same (big endian), regardless of whether the byte is in a register or in a memory cell.
The different endianess only pertains to multibyte integers (e.g. in C, int or short).

What's the exact meaning of the statement "Since ASCII used 7 bits for the character, it could only represent 128 different characters"?

I come across the below statement while studying about HTML Character Sets and Character Encoding :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
When we convert any decimal value from the ASCII character set to its binary equivalent it comes down to a 7-bits long binary number.
E.g. For Capital English Letter 'E' the decimal value of 69 exists in ASCII table. If we convert '69' to it's binary equivalent it comes down to the 7-bits long binary number 1000101
Then, why in the ASCII Table it's been mentioned as a 8-bits long binary number 01000101 instead of a 7-bits long binary number 1000101 ?
This is contradictory to the statement
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
The above statement is saying that ASCII used 7 bits for the character.
Please clear my confusion about considering the binary equivalent of a decimal value. Whether should I consider a 7-bits long binary equivalent or a 8-bits long binary equivalent of any decimal value from the ASCII Table? Please explain to me in an easy to understand language.
Again, consider the below statement :
Since ASCII used 7 bits for the character, it could only represent 128
different characters.
According to the above statement how does the number of characters(128) that ASCII supports relates to the fact that ASCII uses 7 bits to represent any character?
Please clear the confusion.
Thank You.
In most processors, memory is byte-addressable and not bit-addressable. That is, a memory address gives the location of an 8-bit value. So, almost all data is manipulated in multiples of 8 bits at a time.
If we were to store a value that has by its nature only 7 bits, we would very often use one byte per value. If the data is a sequence of such values, as text might be, we would still use one byte per value to make counting, sizing, indexing and iterating easier.
When we describe the value of a byte, we often show all of its bits, either in binary or hexadecimal. If a value is some sort of integer (say of 1, 2, 4, or 8 bytes) and its decimal representation would be more understandable, we would write the decimal digits for the whole integer. But in those cases, we might lose the concept of how many bytes it is.
BTW—HTML doesn't have anything to do with ASCII. And, Extended ASCII isn't one encoding. The fundamental rule of character encodings is to read (decode) with the encoding the text was written (encoded) with. So, a communication consists of the transferring of bytes and a shared understanding of the character encoding. (That makes saying "Extended ASCII" so inadequate as to be nearly useless.)
An HTML document represents a sequence of Unicode characters. So, one of the Unicode character encodings (UTF-8) is the most common encoding for an HTML document. Regardless, after it is read, the result is Unicode. An HTML document could be encoded in ASCII but, why do that? If you did know it was ASCII, you could just as easily know that it's UTF-8.
Outside of HTML, ASCII is used billions—if not trillions—of times per second. But, unless you know exactly how it pertains to your work, forget about it, you probably aren't using ASCII.

Deflate length of 258 double encoding

In Deflate algorithm there are two ways to encode a length of 258:
Code 284 + 5 extra bits of all 1's
Code 285 + 0 extra bits;
On first glance, this is not optimal, because the proper use of code 285 would allow a length of 259 be encoded;
Is this duality some specification mistake, not fixed because of compatibility reasons, or there are some arguments about it - for example length of 258 must be encoded with shorter code (0 extra bits) because of some reason?
We may never know. The developer of the deflate format, Phil Katz, passed away many years ago at a young age.
My theory is that a match length was limited to 258 so that a match length in the range 3..258 could fit in a byte, encoded as 0..255. This format was developed around 1990, when this might make a difference in an assembler implementation.
Adding a second answer here to underscore Mark's guess that allowing the length to be encoded in a byte is helpful to assembler implementations. At the time 8086 level assembler was still common and using the 8 bit form of registers gave you more of them to work with than using them in 16 bit size.
The benefit is even more pronounced on 8 bit processors such as the 6502. It starts with the length decoding. Symbols 257 .. 264 represent a match length of 3 .. 10 respectively. If you take the low byte of those symbols (1 .. 8) you get exactly 2 less than the match length.
A more complicated yet fairly easy to compute formula gives 2 less than the match length of symbols 265 through 284. 2 less than the match length of symbol 285 is 256. That doesn't fit in a byte but we can store 0 which turns out to be equivalent.
zlib6502 uses this for considerable advantage. It calculates the match length in inflateCodes_lengthMinus2. And once the back pointer into the window has been determined it copies the data like so:
jsr copyByte
jsr copyByte
inflateCodes_copyByte
jsr copyByte
dec inflateCodes_lengthMinus2
bne inflateCodes_copyByte
It makes two explicit calls to copy a byte and then loops over the length less 2. Which works as you would expect for lengths 1 to 255. For length 0 it will actually iterate 256 times as we desire. The first time through the loop the length of 0 is decremented to 255 which is non-zero so the loop continues 255 more times for a total of 256.
I'd have to think that Phil Katz understood intuitively if not explicitly the benefits of keeping the length of matches within 8 bits.

How to used the alphabet binary symbols

I was reading an article on binary numbers and it had some practice problems at the end but it didn't give the solutions to the problems. The last is "How many bits are required to represent the alphabet?". Can tell me the answer to that question and briefly explain why?
Thanks.
You would only need 5 bits because you are counting to 26 (if we take only upper or lowercase letters). 5 bits will count up to 31, so you've actually got more space than you need. You can't use 4 because that only counts to 15.
If you want both upper and lowercase then 6 bits is your answer - 6 bits will happily count to 63, while your double alphabet has (2 * 24 = 48) characters, again leaving plenty of headroom.
It depends on your definition of alphabet. If you want to represent one character from the 26-letter Roman alphabet (A-Z), then you need log2(26) = 4.7 bits. Obviously, in practice, you'll need 5 bits.
However, given an infinite stream of characters, you could theoretically come up with an encoding scheme that got close to 4.7 bits (there just won't be a one-to-one mapping between individual characters and bit vectors any more).
If you're talking about representing actual human language, then you can get away with a far lower number than this (in the region of 1.5 bits/character), due to redundancy. But that's too complicated to get into in a single post here... (Google keywords are "entropy", and "information content").
There are 26 letters in the alphabet so you 2^5 = 32 is the minimum word length than contain all the letters.
How direct does the representation need to be? If you need 1:1 with no translation layer, then 5 bits will do. But if a translation layer is an option, then you can get away with less. Morse code, for example, can do it in 3 bits. :)