About MIPS lb and sb and endianness - mips

I just read a comment by #Cheshar in this answer - Loading and storing bytes in MIPS.
This is my reasoning regarding his first point: the value in $t0 should be 0xFFFFFF90 (i.e. it's sign-extended) but this won't change the result of mem(4) (I think this means read the word started at 0x04) which is still FFFF90FF. Am I correct?
But I'm not sure about his second point:
["] lb and sb doesn't care for endianness. [."]
I'm thinking about why
the changing from big endian to little endian is
byte: 0 1 2 3 ----\ 3 2 1 0
00 90 12 A0 ----/ 00 90 12 A0
so it seems like individual byte is still read like big endian?

lb/sb do not care about endianess. There is no endianess for a single byte.
It only matters if you store a big/little endian [(e.g.) 4 byte] number and then try to access it byte-by-byte.
The byte offsets do not change, so a better diagram might be:
byte: 0 1 2 3 ----\ 0 1 2 3
00 90 12 A0 ----/ A0 12 90 00
If $t1 points to your stored integer, when you do:
lb $t0,1($t1)
You get 90 for little endian and 12 for big endian.
UPDATE:
I upvote your answer since it's clean. But didn't you think that is counter intuitive before? Since in little endian the 32-bit integer has no meaning when read 32-bit all together from either left to right or right to left...?
Once the data is in a register (via lw), we visualize and operate on it as big endian (i.e. a left shift by 1 is a multiply by 2).
The decimal value 123 is "big endian" (one hundred + twenty + three).
Little endian is merely the byte order when we fetch from or store into memory. The hardware will shuffle the bytes as needed.
The advantage of little endian is that it works better for large multiprecision numbers (e.g. libgmp).
And, when Intel first came out with the 8 bit 8080 processor (with only a single byte memory bus), little endian made things faster. For example, when doing an add, after fetching the LSB at offset 0, it could do the add of the two LSB bytes in parallel with the fetch of MSB at offset 1.
To give an example: 8-bit (unsigned) integer b00100001 is 33(decimal), but with little endian it is stored as b00010010, which is 18(decimal) when read from left to right, and b01001000, which is 64+8=72(decimal) when read from right to left, bit by bit.
While is is possible for a [theoretical] computer architecture to behave as you describe, no modern one [that I'm aware of] does. That's partly because to do it requires more complex circuitry.
However, I once wrote a multiprecision math package that did use little endian bytes and little endian bits within the bytes. But, it was slow. This is sometimes useful for large bit vectors (e.g. 500,000 bits wide)
Or my idea is completely wrong since computer can only see byte as an abstraction of underlying bits.
The endianess of bits in a byte is the same (big endian), regardless of whether the byte is in a register or in a memory cell.
The different endianess only pertains to multibyte integers (e.g. in C, int or short).

Related

MySQL character set for numbers compression

I would like to store many numbers in 1 cell and save space. The idea is to encode each one to a string of a constant length and store them in a text field (presumably MEDIUMTEXT). What characters can be used so that they are 1 byte only? I assume that special characters are stored in a way that uses more than 1 byte. I can use e.g. base64 but I am not sure how many encoding characters I can add to the base before MySQL uses actually more space to store them than I manage to save.
You say "numbers". What do you mean, really?
Digits? See above.
Integers? (no decimal point, no fraction)
Floats? (with exponent, etc)
Some notes on digits, compression, etc:
1 byte per Ascii character -- 8 bits
1 byte per digit, since it is an ascii character
One random digit, at maximum theoretical compression, is about 0.332 bytes. Visualize it this way: 1000 is 3 digits and 1024 is 10 bits.
MySQL's DECIMAL datatype puts 2 digits into one byte for smaller numbers; for larger numbers it stores 9 digits in 4 bytes.
If you zip up a million digits of pi, it will come very close to the above compression.
A simple Rule of Thumb is that "text" compresses 3:1.
Base64 expands bytes by 8/6 because one 8-bit byte is represented by 64 (2^6) different characters.
Base64 is more useful for avoiding special characters; it is not really a compression technique.
A 4-byte MySQL INT (range of -2 billion to +2 billion, but usually just positive and not evenly distributed), when converted to base64 would take more than 5 bytes for 9-10 digits.
General techniques
Client compression: For 123,2345,88,22, here is one way to handle it. In fact, I recommend this for virtually any text handling where compression is desired in MySQL.
use compress() (or similar function) in your client.
use BLOB (up to 64KB) or MEDIUMBLOB (up to 16MB) in the table
use uncompress() after retrieving the blob.
For an array of numbers, use json_encode for the array, then feed into compress+blob, above. It will work for any sized "numbers" and provide nearly maximal compresssion.
You cannot efficiently reach into a MEDIUMTEXT or BLOB to get one number out of an array. The entire cell will be fetched.
That leads to another general statement... If you have a lot of stuff that you don't need to sort on, nor fetch individually, JSON is a good approach. Think of it as from MySQL's point of view an opaque blob. The application writes and rereads it as one big thing, then picks it apart.
The JSON will possibly encode the above example as ["123","2345","88","22"], which will be slightly fatter after compression. But, any good compression algorithm will notice and take advantage of the repetition.
Take advantage of the data
17,22738 48,77795 300
17,22792 48,77795 297
17,22853 48,77764 294
17,22874 48,77743 297
17,22887 48,77704 300
17,22968 48,77671 305
17,23069 48,77563 296
17,23092 48,77561 292
-->
17,22738 48,77795 300
54 0 -3
61 -31 -3
21 -21 3
13 -39 3
81 -33 5
1 -108 -9
23 -2 -4
The numbers stay relatively constant. Take advantage of it by starting with raw data, but then switching to deltas. Try it will about 10 times as much data; I suspect you will continue to get better than 2x compression before zipping, but maybe slightly less than 2x after zipping. (Zipping can take advantage of the repetition of 48,777; I am taking more advantage of it by tossing most of it.)

largest integer that can be stored in a double such that all integers less than can be accurately stored as well

This is some more clarification to the question that was already answered some time ago here: biggest integer that can be stored in a double
The top answer mentions that "the largest integer such that it and all smaller integers can be stored in IEEE 64-bit doubles without losing precision. An IEEE 64-bit double has 52 bits of mantissa, so I think it's 2^53:
because:
253 + 1 cannot be stored, because the 1 at the start and the 1 at the end have too many zeros in between.
Anything less than 253 can be stored, with 52 bits explicitly stored in the mantissa, and then the exponent in effect giving you another one.
253 obviously can be stored, since it's a small power of 2.
Can someone clarify the first point? What does he mean by that? is he talking about for example if it were a 4 bit number 1000 + 0001, you can't store that in 4 bits? 253 is just the first bit 1 and the rest 0's right? how come you can't add a 1 to that without losing precision?
also, "The largest integer such that it and all smaller integers can be stored in IEEE". Is there some general rule such that if I wanted to find the largest n bit integer such that it and all smaller integers can be stored in IEEE, could I simply say that it is 2n? example if I were to find the largest 4 bit integers such that it and all integer below it can be represented, it would be 2^4?
is he talking about for example if it were a 4 bit number 1000 + 0001, you can't store that in 4 bits?
No, he is saying that you can't store that in 3 bits. Using the usual binary notation.
253 is just the first bit 1 and the rest 0's right?
Yes, and so are 1, 2, 4, …, 253, 254, 255, …, 2123, 2124, … and also 0.125.
This is floating-point we are talking about. 253 is just an implicit 1 with all explicit significand bits 0, yes, but it is not the only number with this property. The crucial property is that the ULP for representing 253 is 2. So 253 can be represented as all powers of two that are in range, and 253+1 cannot because the ULP is too large in that neighborhood.
also, "The largest integer such that it and all smaller integers can be stored in IEEE". Is there some general rule such that if I wanted to find the largest n bit integer such that it and all smaller integers can be stored in IEEE, could I simply say that it is 2n?
Yes, in binary IEEE 754 floating-point, all “largest integer such that it and all smaller integers can be stored” are powers of two, and specifically 2n where n is the significand's width (counting the implicit bit).

Deflate length of 258 double encoding

In Deflate algorithm there are two ways to encode a length of 258:
Code 284 + 5 extra bits of all 1's
Code 285 + 0 extra bits;
On first glance, this is not optimal, because the proper use of code 285 would allow a length of 259 be encoded;
Is this duality some specification mistake, not fixed because of compatibility reasons, or there are some arguments about it - for example length of 258 must be encoded with shorter code (0 extra bits) because of some reason?
We may never know. The developer of the deflate format, Phil Katz, passed away many years ago at a young age.
My theory is that a match length was limited to 258 so that a match length in the range 3..258 could fit in a byte, encoded as 0..255. This format was developed around 1990, when this might make a difference in an assembler implementation.
Adding a second answer here to underscore Mark's guess that allowing the length to be encoded in a byte is helpful to assembler implementations. At the time 8086 level assembler was still common and using the 8 bit form of registers gave you more of them to work with than using them in 16 bit size.
The benefit is even more pronounced on 8 bit processors such as the 6502. It starts with the length decoding. Symbols 257 .. 264 represent a match length of 3 .. 10 respectively. If you take the low byte of those symbols (1 .. 8) you get exactly 2 less than the match length.
A more complicated yet fairly easy to compute formula gives 2 less than the match length of symbols 265 through 284. 2 less than the match length of symbol 285 is 256. That doesn't fit in a byte but we can store 0 which turns out to be equivalent.
zlib6502 uses this for considerable advantage. It calculates the match length in inflateCodes_lengthMinus2. And once the back pointer into the window has been determined it copies the data like so:
jsr copyByte
jsr copyByte
inflateCodes_copyByte
jsr copyByte
dec inflateCodes_lengthMinus2
bne inflateCodes_copyByte
It makes two explicit calls to copy a byte and then loops over the length less 2. Which works as you would expect for lengths 1 to 255. For length 0 it will actually iterate 256 times as we desire. The first time through the loop the length of 0 is decremented to 255 which is non-zero so the loop continues 255 more times for a total of 256.
I'd have to think that Phil Katz understood intuitively if not explicitly the benefits of keeping the length of matches within 8 bits.

The name of 16 and 32 bits

8 bits is called "byte". How is 16 bits called? "Short"? "Word"?
And what about 32 bits? I know "int" is CPU-dependent, I'm interested in universally applicable names.
A byte is the smallest unit of data that a computer can work with. The C language defines char to be one "byte" and has CHAR_BIT bits. On most systems this is 8 bits.
A word on the other hand, is usually the size of values typically handled by the CPU. Most of the time, this is the size of the general-purpose registers. The problem with this definition, is it doesn't age well.
For example, the MS Windows WORD datatype was defined back in the early days, when 16-bit CPUs were the norm. When 32-bit CPUs came around, the definition stayed, and a 32-bit integer became a DWORD. And now we have 64-bit QWORDs.
Far from "universal", but here are several different takes on the matter:
Windows:
BYTE - 8 bits, unsigned
WORD - 16 bits, unsigned
DWORD - 32 bits, unsigned
QWORD - 64 bits, unsigned
GDB:
Byte
Halfword (two bytes).
Word (four bytes).
Giant words (eight bytes).
<stdint.h>:
uint8_t - 8 bits, unsigned
uint16_t - 16 bits, unsigned
uint32_t - 32 bits, unsigned
uint64_t - 64 bits, unsigned
uintptr_t - pointer-sized integer, unsigned
(Signed types exist as well.)
If you're trying to write portable code that relies upon the size of a particular data type (e.g. you're implementing a network protocol), always use <stdint.h>.
The correct name for a group of exactly 8 bits is really an octet. A byte may have more than or fewer than 8 bits (although this is relatively rare).
Beyond this there are no rigorously well-defined terms for 16 bits, 32 bits, etc, as far as I know.
Dr. Werner Buchholz coined the word byte to mean, "a unit of digital information to describe an ordered group of bits, as the smallest amount of data that a computer could process." Therefore, the word's actual meaning is dependent on the machine in question's architecture. The number of bits in a byte is therefore arbitrary, and could be 8, 16, or even 32.
For a thorough dissertation on the subject, refer to Wikipedia.
There's no universal name for 16-bit or 32-bit units of measurement.
The term 'word' is used to describe the number of bits processed at a time by a program or operating system. So, in a 16-bit CPU, the word length is 16 bits. In a 32-bit CPU, the word length is 32 bits. I also believe the term is a little flexible, so if I write a program that does all its processing in chunks of say, 10 bits, I could refer to those 10-bit chunks as 'words'.
And just to be clear; 'int' is not a unit of measurement for computer memory. It really is just the data type used to store integer numbers (i.e. numbers with a decimal component of zero). So if you find a way to implement integers using only 2 bits (or whatever) in your programming language, that would still be an int.
short, word and int are all dependent on the compiler and/or architecture.
int is a datatype and is usually 32-bit on desktop 32-bit or 64-bit systems. I don't think it's ever larger than the register size of the underlying hardware, so it should always be a fast (and usually large enough) datatype for common uses.
short may be of smaller size then int, that's all you know. In practice, they're usually 16-bit, but you cannot depend on it.
word is not a datatype, it rather denotes the natural register size of the underlying hardware.
And regarding the names of 16 or 32 bits, there aren't any. There is no reason to label them.
I used to ear them referred as byte, word and long word. But as others mention it is dependant on the native architecture you are working on.
They are called 2 bytes and 4 bytes
There aren't any universal terms for 16 and 32 bits. The size of a word is machine dependent.

Why do we Sign Extend in load word instruction?

I am learning MIPS 32 bit. I wanted to ask that why do we Sign Extend the 16 bit offset (in Single Cycle Datapath) before sending it to the ALU in case of Store Word?
I am not sure if it's helpful for you now, but I am posting it anyway.
Let us consider in a very very general sense, an array of instructions in C++ i.e. A[0],A[1],A[2] .....
The "figurative" distance between any two instructions is 1 UNIT.
Lets take this analogy to MIPS. In MIPS, figuratively every instruction is separated by "1 UNIT", however, 1 UNIT = 4 Bytes in MIPS. Every instruction is 4 Bytes long and this is why when moving from instruction to instruction the PC is incremented by 4 i.e. PC+4. So that way the gap between instruction i and instruction i+2 is "figuratively" 2 but actually 2*4=8 i.e. PC+4+4
Coming back to offsets that are specified in Branch instructions, the offset represents the "figurative" distance from the next instruction(the instruction following the Branch). So to get the "real" distance, the offset is to be multiplied by 4. This is the reason we are instructed to "sign-extend" the offset by 2 bits to the 'LEFT', because, left shifting any binary value by n bits results in multiplying that value by 2^n. In our case 2^2 = 4
So the actual target address of a branch instruction is PC+4+4*Offset.
Hope this helps.
Sounds like the 16-bit offset is a signed 2's complement number, i.e. it can be either positive or negative.
When converting it to 32 bits, the most significant bit needs to be copied to the upper 16 bits in order to keep the sign information.
To the best of my knowledge,in load or store instructions the offset value is added to the value in temporary register,as temp. register is 32 bit and addition operation of 16 bit and 32 bit is not possible,the value is sign extended.
I think you are getting your concepts a little wrong here.
The 5 bits that you think are going inside the ALU, actually go inside the register memory to select one of the 32[2^5] registers.
Each register itself is of 32 bits. Hence, to add the offset to the register value, you need to sign extend it to 32 bits.
ALU operation is always between two registers of the same size in the single cycle datapath for MIPS.
In the hardware of a 32-bit machine most ALU's take 32-bit inputs, and all registers are 32-bit registers.
To work with your data it must be 32-bits wide, this why we SIGN-extend, however another approach would be to ZERO-extend, but SIGN-extend is used when you are dealing with immediates and offsets to preserve the sign in 2's complement.
Sign extension happens e.g. in case of M68xxx machines only in case of loading the address registers. Not so in case of data registers.
having e.g.
movea.w addr,a0
move addr,d0
addr:
dc.w $FFFF
leads in case of data register loading to $0000FFFF, in case of the
address register loading however to $FFFFFFFF.
To understand this, build the two complement of the signed negative
presentation, $FFFF, extend the number to 32 bit and redo the two-
complement, finding the corresponding representation in 32 bit.
Cheers and kind regards,
Stephan S.