Math behind 4GB limit on 32 bit systems - binary

I have a very fundamental question relating to 32 bit memory addresses. My understanding is that 2^32 is the maximum number of possible memory addresses on a 32 bit system. Where I am confused is how we go from this number to the alleged 4GB limit. In my research I have seen some people do this:
2^32 = 4,294,967,296 bytes
4,294,967,296 / (1,024 * 1,024) = ~4 GB
First, where does this (1,024 * 1,024) come from?
Second, correct me if I am wrong, but 4,294,967,296 is labeled as bytes because a byte is the smallest unit of storage space that can be addressed in RAM. Since we're limited to 2^32 addresses, that's the number of bytes that can be addressed.
Third, even though the smallest addressable space in RAM is a byte, this must not be the case with the hard-drive because 32 bit systems usually have hard disk's well in excess of 4 GB. Can someone briefly describe the addressing scheme for hard disks?

This is a case of basic arithmetics: Number of bytes per addressed unit times number of addressable units equals number of addressable bytes.
The hard part is, where to get those numbers from. Here is my take on it:
1 - What is a Kilobyte, Megabyte, Gigabyte?
For RAM, there is consent, that a Gigabyte is 1024 Megabytes, each consisting of 1024 Kilobytes, each being 1024 Bytes. This stems from the fact, that 1024 is 2^10, but close enough to 1000 to historically allow the Kilo prefix
For Storage, vendors have years ago started to use strictly decimal units, a Megabyte being 1000000 bytes (As it makes the capacities look bigger in glossy brochures)
This has led to 1024*1024 Bytes being called a MiB and 1000*1000 Bytes being called a MB
2 - The addressable unit
For RAM, the addressable unit is the byte, even if it is fetched from physical RAM in chunks of at least 4.
For mass Storage, the addressable unit is the sector or block, which most often is 512 bytes, but 4096 Bytes catches up fast.
3 - The number of addressable units is much more complicated, let's start with RAM:
A 32 Bit CPU (sans the MMU!) can address 2^32 Bytes or 4 GiB
All modern 32 Bit CPUs include a MMU, that maps these 4 GiB of virtual address space into a physical address space
This physical address space can have a different size than 4 GiB, as a function of the MMU using more (or in prehistoric times less) than 32 physical address lines. Today's most common implementation are 36 or more physical Bits, resulting in 16*4 GiB or more (PAE or physical adress extension)
This MMU magic does not work around the CPU running in 32 Bit mode, i.e. for every process, the address space can't be larger than 4 GiB
To make things a little more interesting, a part of this address space is used for kernel functionality in every modern OS I know of. This results in 2 GiB or 3 GiB maximum usable address space per process for all mainstream OSes.
And as this still is much too simple: Running the MMU in a mode, where it can actually use more than 4 GiB of physical RAM must be supported by the OS. A remarkable example is Windows XP 32 Bit, which does NOT allow that.
And last but not least: A part of the physical address space is most often used for memory-mapping hardware. If this is combined with OS limits as above, it results in Windows XP 32 Bit sometimes being able to use only 2.5 to 3.5 GiB of physical RAM
It's much less of a hassle for storage:
in all modern PC-based Cases I know of, the addressable units are simply counted with 32 or 48 Bits (LBA or logical block addressing). Even in it's most basic version this is enough for 2 TiB of storage per disk (2^32 blocks of 512 Bytes each). Maxed-out versions with 48 Bit LBA and 4 KiB per block result in ca. a Gazillion TiB per disk.

A computer is not all memory. The 32 bits are the maximum spots for an Instruction Set to be organized. 64 bits gives you more bits to reference more memory. I think those people meant 4,294,967,296 bit combinations not bytes (8 bits).
As for the math - it seems to mean that 20 bits are reserved for other uses besides specifying a possible memory address.

Related

Questions regarding Fermi-Architecture, Warps and Perfomance

As the fermi-whitepaper suggests, there are 16 SMs (Streaming Multiprocessors), whereas each of them consists of 32 cores. The gpu executes a thread of a group of 32 threads, called warp.
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
And if so, does it mean that in total the fermi-architecture allows exeucting operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
If so, how many times can it execute 512 datas in parallel in one second?
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
Yes.
And if so, does it mean that in total the fermi-architecture allows executing operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
Yes, possibly, depending on the operation type. A GPU SM includes functional units that handle different types of operations (instructions). An integer add may not be handled by the same functional unit as a floating-point add, for example. Because different operations are handled by different functional units, and due to the fact that there is no particular requirement that the GPU SM contain 32 functional units for each instruction (type), the specific throughput will depend on the instruction. However the 32 functional units you are referring to can handle a float add, multiply, or multiply-add. So for those specific operation types, your calculation is correct.
If so, how many times can it execute 512 datas in parallel in one second?
This is given by the clock rate, divided by the number of clocks to service an instruction. For example, with 32 FP add units, the GPU can theoretically retire one of these, for 512 "datas" in a single clock cycle. If there were another operation, such as integer add, which only had 16 functional units to service it, then it would require 2 clocks to service it warp-wide. So we would divide the number by 2. And if you had a mix of operations, say 8 floating-point adds issued on 8 SMs, and 8 integer adds issued on the other 8 SMs, then you would have a more complex calculation, perhaps.
The theoretical maximum floating point throughput is computed this way. For example, the Fermi M2090 has all 16 SMs enabled, and is claimed to have a peak theoretical throughput of 1332 GF/s for FP32 ops. That calculation is as follows:
16 SMs * 32 functional units/SM * 2 ops/functional unit/hotclk * 2 hotclock/clk * 651M clks/sec = 1333GF/s FP32

Why are CUDA memory allocations aligned to 256 bytes?

According to cuda alignment 256bytes seriously? CUDA memory allocations are guaranteed to be aligned to at least 256 bytes.
Why is that the case? 256 bytes is much larger than any numeric data type. It might be the size of a vector, but GPUs do not require load/store to be aligned to the size of the whole vector, indeed they go so far as to support gather/scatter where every individual element may be placed at any memory address that is a multiple of the size of the element.
What purpose does the 256-byte alignment serve?
Why is that the case? 256 bytes is much larger than any numeric data type.
Well, I'm sure there are multiple reasons (e.g. it's easier to manage fewer, larger, allocations), but about your specific point: Don't think about a single value of a numeric data type - think about a full warp's worth: if sizeof(float) is 4, then a warp's worth of floats is 32 * 4 = 128 bytes. And if it's a double or long int (64-bit int), then you get 32 * 8 = 256 .
Note: It is not necessary for warps to make such coalesced reads of multiple values from memory. A single thread can read a single unaligned byte and that will work. But - performance will suffer if the read pattern is not coalesced to reading contiguous, aligned, chunks (typically of 128 bytes or 32 bytes); see also:
In CUDA, what is memory coalescing, and how is it achieved?

size of memory of computer that uses 16 bits memory address

If the memory address of a computer uses 16 bits, what is the size of its memory? I find many references online but I can't be certain which are relevant. Thank you.
2^16?
From wikipedia:
For instance, a computer said to be "32-bit" also usually allows
32-bit memory addresses; a byte-addressable 32-bit computer can
address 2^32 = 4,294,967,296 bytes of memory, or 4 gibibytes (GiB).
This seems logical and useful, as it allows one memory address to be
efficiently stored in one word.
So to answer your question, in general, yes a 16-bit computer can address 2^16 bytes of memory per word write.

Bytes in a 32 bit system?

I am in the process of interviewing at a few places and I saw this question in one of the discussion forums.
How many bytes are contained in a 32
bit system?
The answer given is 2^29 or 536870912 - I believe it's because a 32 bit system can address 2^32 bits of memory and 8 bits to a byte gives 2^32/8 = 2^29 bytes.
Can someone confirm if I'm on the right track?
Thanks!
Addressable unit is a byte, not a bit.
So 32bit pointer allows to address 2^32 bytes.
If the question really was: "How many bytes are in a 2^32 bit system?", the answer is correct.
(But still bad phrased)
It's not that 2**32 bits are accessible, it's that 2**32 words are accessible. If we say 4 bytes per word, then 2**34 bytes is a closer value.
Although traditional systems are byte-oriented and therefore could access 2**32 bytes.

Virtual addresses size computing [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am stuck on this problem which I am studying for an exam tomorrow. (I understand the concept of virtual vs. physical addresses, page frames, address bus, etc.)
If you're using 4K pages with 128K of RAM and a 32 bit address bus, how large could a virtual address be? How many regular page frames could you have?
EDIT: I believe the answer is 2^32 and 2^20. I just do not know how to compute this.
Your answers are exactly right.
With a 32-bit address bus, you can access a virtual space of 2^32 unique addresses.
Each 4K page uses 2^12 (physical) addresses, so you can fit (2^32) / (2^12) = 2^20 pages into the space.
Good luck with your exam!
Edit to address questions in the comments:
How do you know you cannot access more than 2^32 addresses?
A 32-bit address bus means there are 32 wires connected to the address pins on the RAM--each wire is represented by one of the bits. Each wire is held at either a high or low voltage, depending on whether the corresponding bit is 1 or 0, and each particular combination of ones and zeroes, represented by a 32-bit value such as 0xFFFF0000, selects a corresponding memory location. With 32 wires, there are 2^32 unique combinations of voltages on the address pins, which means you can access 2^32 locations.
So what about the 4K page size?
If the system has a page size of 4K, it means the RAM chips in each page have 12 address bits (because 2^12 = 4K). If your hypothetical system has 128K of RAM, you'd need 128K/4K = 32 pages, or sets of RAM chips. So you can use 12 bits to select the physical address on each chip by routing the same 12 wires to the 12 address pins on every RAM chip. Then use 5 more wires to select which one of the 32 pages contains the address you want. We've used 12 + 5 = 17 address bits to access 2^17 = 128K of RAM.
Let's take the final step and imagine that the 128K of RAM resides on a memory card. But with a 32-bit address bus, you still have 32-17 = 15 address bits left! You can use those bits to select one of 2^15 = 32768 memory cards, giving you a total virtual address space of 2^32 = 4G of RAM.
Is this useful beyond RAM and memory cards?
It's common practice to divide a large set of bits, like those in the address, into smaller sub-groups to make them more manageable. Sometimes they're divided for physical reasons, such as address pins and memory cards; other times it's for logical reasons, such as IP addresses and subnets. The beauty is that the implementation details are irrelevant to the end user. If you access memory at address 0x48C7D3AB, you don't care which RAM chip it's in, or how the memory is arranged, as long as a memory cell is present. And when you browse to 67.199.15.132, you don't care if the computer is on a class C subnet as long as it accepts your upvotes. :-)