What is the cost of memory access? - language-agnostic

We like to think that a memory access is fast and constant, but on modern architectures/OSes, that's not necessarily true.
Consider the following C code:
int i = 34;
int *p = &i;
// do something that may or may not involve i and p
{...}
// 3 days later:
*p = 643;
What is the estimated cost of this last assignment in CPU instructions, if
i is in L1 cache,
i is in L2 cache,
i is in L3 cache,
i is in RAM proper,
i is paged out to an SSD disk,
i is paged out to a traditional disk?
Where else can i be?
Of course the numbers are not absolute, but I'm only interested in orders of magnitude. I tried searching the webs, but Google did not bless me this time.

Here's some hard numbers, demonstrating that exact timings vary from CPU family and version to version: http://www.agner.org/optimize/
These numbers are a good guide:
L1 1 ns
L2 5 ns
RAM 83 ns
Disk 13700000 ns
And as an infograph to give you the orders of magnitude:
(src http://news.ycombinator.com/item?id=702713)

Norvig has some values from 2001. Things have changed some since then but I think the relative speeds are still roughly correct.

It could also be in a CPU-register. The C/C++-keyword "register" tells the CPU to keep the variable in a register, but you can't guarantee it will stay or even ever get in there.

As long as the Cache/RAM/Harddisk/SSD is not busy serving other access (e.g. DMA requests) and that the hardware is reasonably reliable, then the cost is still constant (though they may be a large constant).
When you get a cache miss, and you have to page to harddisk to read the variable, then it's just a simple harddisk read request, this cost is huge, as the CPU has to: send interrupt to the kernel for harddisk read request, send a request to harddisk, wait for the harddisk to write the data to RAM, then read the data from RAM to cache and to a register. However, this cost is still constant cost.
The actual numbers and proportions will vary depending on your hardware, and on the compatibility of your hardware (e.g. if your CPU is running on 2000 Mhz and your RAM sends data at 333 Mhz then they doesn't sync very well). The only way you can figure this out is to test it in your program.
And this is not premature optimization, this is micro-optimization. Let the compiler worry about these kind of details.

These numbers change all the time. But for rough estimates for 2010, Kathryn McKinley has nice slides on the web, which I don't feel compelled to copy here.
The search term you want is "memory hierarchy" or "memory hierarchy cost".

Where else can i be?
i and *i are different things, both of them can be located in any of the locations in your list. The pointer address might additionally still be stored in a CPU register when the assignment is made, so it doesn't need to be fetched from RAM/Cache/…
Regarding performance: this is highly CPU-dependent. Thinking in orders of magnitude, accessing RAM is worse than accessing cache entries and accessing swapped-out pages is the worst. All are a bit unpredictable because they depend on other factors as well (i.e. other processors, depending on the system architecture).

Related

Separate instruction and data memory

I am currently in a Computer Architecture class and this is the one thing majorly stumping me. I asked my professor why we have separate instruction and data memory (consider the single-cycle MIPS data path I'm attaching).
My thoughts:
add extra ports (not an issue of FU reuse, similar to register file implementation but with a port for instructions)
consolidate so that memory could be unified and not go unused
His:
agreed with me on last point
ports are quadratic negative increase in perf
separate allows more leeway in placement on chip
single-access memory is faster
Could anyone please elaborate on any of these points in more depth, or add anything of their own? I'm still not fully clear on this.
Yes, multi-ported DRAM is an option, but much more expensive, probably more than twice as expensive per byte. (And lower capacity per die area, so available sizes will be smaller).
In practice real CPUs just have split L1d/L1i caches, and unified L2 cache and memory, assuming it's ultimately a von Neumann type of architecture.
We call this "modified Harvard" - the performance advantages of Harvard allowing parallel code-fetch and load/store, except for contention for access to the unified cache or memory. But it's rare to have lots of code cache misses at the same time as data misses, because if you're stalling on code fetch then you'll have bubbles in the pipeline anyway. (Out-of-order exec could hide that better than a single single-cycle design, of course!)
It needs extra sync / pipeline flushing when we want to run machine code that we recently generated / stored, e.g. a JIT compiler, but other than that it has all the advantages of unified memory and the CPU-pipeline advantages of the Harvard split. (You need extra synchronization anyway to run recently-stored code on an ISA that allows deeply pipelined and out-of-order exec implementations, and which fetch code far ahead into buffers in the pipeline to give more room to absorb bubbles).
What does a 'Split' cache means. And how is it useful(if it is)?
L1 caches usually have split design, but L2, L3 caches have unified design, why?
The first pipelined CPUs had small caches or in the case of MIPS R2000 even off-chip caches with only the controllers on-chip. But yes, MIPS R2000 had split I and D cache. Because you don't want code-fetch to conflict with the MEM stage of load or store instructions; that would introduce a structural hazard that would interfere with running 1 instruction per cycle when you don't have cache misses.
In a single-cycle design I guess your cycle would normally be long enough to access memory twice because you aren't overlapping code-fetch and load/store, so you might not even need multi-ported memory?
L1 data caches are already multi-ported on modern high-performance CPUs, allowing them to commit a store from the store buffer in the same cycle as doing 1 or 2 loads on load execution units.
Having even more ports to also allow code-fetch from it would be even more expensive in terms of power, vs. two slightly smaller caches.
If you think of the Instruction Memory and Data Memory as caches, as in being backed by a unified main memory, then you have the traditional Modified Harvard Architecture, which has some of the advantages of both the Von Neumann and the Harvard Architecture together.
One point you didn't seem to raise is that separation of the two memories (caches) allows for simultaneous access, so an instruction can be read while a data memory is read or written in the same cycle.  This would be more difficult with a unified cache/memory.  This advantage applies to single cycle and pipelined processors since in both designs there is overlap between instruction fetch (IF stage in pipelined) and memory operations (MEM stage in pipelined).
Further, as the Instruction Memory is read-only it has less circuitry.  In the case of being caches, the IM has no dirty bits, no write back, etc..  Further, the IM and DM can have different associativity.
In the case of not being caches, it is not clear how the computer system loads the instruction memory, perhaps it is some fast ROM or is loaded by an external device from ROM into IM.  A number of embedded systems have Instruction Tightly Integrated Memory (and/or Data memory ITIM/DTIM) that then do not act as caches and are not necessarily backed by main memory, instead serving as the primary memories.

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik

Conceptual understanding of a Memory Bandwidth of a GPU

I am a little confused by the concept of the memory bandwidth of a GPU.
According to the TESLA M 2090 GPU specs it says
the peak bandwidth is 177.6 GB/s.
So When people refer to bandwidth, does it refer to
the speed of one way traffic ,as in the number of bytes per second which can be read ,
from the device
the speed of the two way-traffic , as in the number of bytes per second which can be read and written to the device memory.
Wherever I read this term, I dont see this clarification being made
There is only one set of wires on the bus so data can't be written or read at the same time. In theory the bandwidth is the same, total read+write == total read == total write.
But in practice the transfers are much more efficient if you are writing large contiguous blocks of data to the device, this is the most common usage and is what the system is optimised for.
edit. The internal memory bandwidth of a graphics card (ie the memory path between various components on the card) is much higher than the bandwidth to/from the computer.
It's also much more complex, there are different types of memory connected to different processors in different ways and the manufacturer will pick the numbers that make it sound the highest - this number is really meaningless except to compare different models of very similar cards from the same GPU family.
The bandwidth is the amount of data that can be read or written in a given period of time.
The same bus is used for both reads and writes. In a given clock cycle, the bus can be used for either a read or a write.

What happened when all thread of a warp read the same global memory?

I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing

Where does super-linear speedup come from?

In parallel computing theoretically super-linear speedup is not possible. But in practice we do see such cases. One reason is cache effect but I fail to understand what does it play. Also, there are other things involved but what are they? In summary,
How are super-linear speedups possible?
I'm a beginner with respect to parallel computing.
Suppose you have an 8 processor machine, each processor has a 1MB cache, and your computation uses 6MB of data.
On 1 processor the computation will be doing a lot of data movement between CPU, cache and RAM. On 8 processors the computation will only have to move data between CPU and cache. This way you can achieve super-linear speedup.
These figures and this analysis have been simplified for exposition for a beginner.
In short, superlinear speedup is achieved when the total amount of work processors do is strictly less than the total work performed by a single processor.
This can happen in three ways:
The original sequential algorithm was really bad, using the parallel version of the algorithm on one processor will usually do away with the superlinear speedup.
The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
Modern processors have faster and slower memories. Typically it will try to keep the data you are using in the fast memory. We can safely say your amount of data is larger than the amount of fast memory. If you use n processors you have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time (thus amount of work) on the same task.