Does the position of a function/method in a program matters in case of increasing/ decreasing the speed at lower level(memory)? - function

Let say I write a program which contains many functions/methods. In this program some functions are used many times as compared to others.
In this case does the positioning of a function/method matters in terms of altering the speed at lower level(memory).
As currently, I am learning Computer Organization & Architecture, so this doubt arrived in my mind.

RAM itself is "flat", equal performance at any address (except for NUMA local vs. remote memory in a multi-socket machine, or mixed-size DIMMs on a single socket leading to only partial dual-channel benefits1).
i-cache and iTLB locality can make a difference, so grouping "hot" functions together can be useful even if you don't just inline them.
Locality also matters for demand paging of code in from disk: If a whole block of your executable is "cold", e.g. only needed for error handling, program startup doesn't have to wait for it to get page-faulted in from disk (or even soft page faults if it was hot in the OS's pagecache). Similarly, grouping "startup" code into a page can allow the OS to drop that "clean" page later when it's no longer needed, freeing up physical memory for more caching.
Compilers like GCC do this, putting CRT startup code like _start (which eventually calls main) into a .init section in the same program segment (mapping by the program loader) as .text and .fini, just to group startup code together. Any C++ non-const static-initializer functions would also go in that section.
Footnote 1: Usually; IIRC it's possible for a computer with one 4G and one 8G stick of memory to run dual channel for the first 8GB of physical address space, but only single channel for the last 4, so half the memory bandwidth. I think some real-life Intel chipsets / CPUs memory controllers are like that.
But unless you were making an embedded system, you don't choose where in physical memory the OS loads your program. It's also much more normal for computers to use matched memory on multi-channel memory controllers so the whole range of memory can be interleaved between channels.
BTW, locality matters for DRAM itself: its laid out in a row/column setup, and switching rows takes an extra DDR controller command vs. just reading another column in the same open "page". DRAM pages aren't the same thing as virtual-memory pages; a DRAM page is memory in the same row on the same channel, and is often 2kiB. See What Every Programmer Should Know About Memory? for more details than you'll probably ever want about DDR DRAM, and some really good stuff about cache and memory layout.

Related

Separate instruction and data memory

I am currently in a Computer Architecture class and this is the one thing majorly stumping me. I asked my professor why we have separate instruction and data memory (consider the single-cycle MIPS data path I'm attaching).
My thoughts:
add extra ports (not an issue of FU reuse, similar to register file implementation but with a port for instructions)
consolidate so that memory could be unified and not go unused
His:
agreed with me on last point
ports are quadratic negative increase in perf
separate allows more leeway in placement on chip
single-access memory is faster
Could anyone please elaborate on any of these points in more depth, or add anything of their own? I'm still not fully clear on this.
Yes, multi-ported DRAM is an option, but much more expensive, probably more than twice as expensive per byte. (And lower capacity per die area, so available sizes will be smaller).
In practice real CPUs just have split L1d/L1i caches, and unified L2 cache and memory, assuming it's ultimately a von Neumann type of architecture.
We call this "modified Harvard" - the performance advantages of Harvard allowing parallel code-fetch and load/store, except for contention for access to the unified cache or memory. But it's rare to have lots of code cache misses at the same time as data misses, because if you're stalling on code fetch then you'll have bubbles in the pipeline anyway. (Out-of-order exec could hide that better than a single single-cycle design, of course!)
It needs extra sync / pipeline flushing when we want to run machine code that we recently generated / stored, e.g. a JIT compiler, but other than that it has all the advantages of unified memory and the CPU-pipeline advantages of the Harvard split. (You need extra synchronization anyway to run recently-stored code on an ISA that allows deeply pipelined and out-of-order exec implementations, and which fetch code far ahead into buffers in the pipeline to give more room to absorb bubbles).
What does a 'Split' cache means. And how is it useful(if it is)?
L1 caches usually have split design, but L2, L3 caches have unified design, why?
The first pipelined CPUs had small caches or in the case of MIPS R2000 even off-chip caches with only the controllers on-chip. But yes, MIPS R2000 had split I and D cache. Because you don't want code-fetch to conflict with the MEM stage of load or store instructions; that would introduce a structural hazard that would interfere with running 1 instruction per cycle when you don't have cache misses.
In a single-cycle design I guess your cycle would normally be long enough to access memory twice because you aren't overlapping code-fetch and load/store, so you might not even need multi-ported memory?
L1 data caches are already multi-ported on modern high-performance CPUs, allowing them to commit a store from the store buffer in the same cycle as doing 1 or 2 loads on load execution units.
Having even more ports to also allow code-fetch from it would be even more expensive in terms of power, vs. two slightly smaller caches.
If you think of the Instruction Memory and Data Memory as caches, as in being backed by a unified main memory, then you have the traditional Modified Harvard Architecture, which has some of the advantages of both the Von Neumann and the Harvard Architecture together.
One point you didn't seem to raise is that separation of the two memories (caches) allows for simultaneous access, so an instruction can be read while a data memory is read or written in the same cycle.  This would be more difficult with a unified cache/memory.  This advantage applies to single cycle and pipelined processors since in both designs there is overlap between instruction fetch (IF stage in pipelined) and memory operations (MEM stage in pipelined).
Further, as the Instruction Memory is read-only it has less circuitry.  In the case of being caches, the IM has no dirty bits, no write back, etc..  Further, the IM and DM can have different associativity.
In the case of not being caches, it is not clear how the computer system loads the instruction memory, perhaps it is some fast ROM or is loaded by an external device from ROM into IM.  A number of embedded systems have Instruction Tightly Integrated Memory (and/or Data memory ITIM/DTIM) that then do not act as caches and are not necessarily backed by main memory, instead serving as the primary memories.

Is Pinned memory non-atomic read/write safe on Xavier devices?

Double posted here, since I did not get a response I will post here as well.
Cuda Version 10.2 (can upgrade if needed)
Device: Jetson Xavier NX/AGX
I have been trying to find the answer to this across this forum, stack overflow, etc.
So far what I have seen is that there is no need for a atomicRead in cuda because:
“A properly aligned load of a 64-bit type cannot be “torn” or partially modified by an “intervening” write. I think this whole question is silly. All memory transactions are performed with respect to the L2 cache. The L2 cache serves up 32-byte cachelines only. There is no other transaction possible. A properly aligned 64-bit type will always fall into a single L2 cacheline, and the servicing of that cacheline cannot consist of some data prior to an extraneous write (that would have been modified by the extraneous write), and some data after the same extraneous write.” - Robert Crovella
However I have not found anything about cache flushing/loading for the iGPU on a tegra device. Is this also on “32-byte cachelines”?
My use case is to have one kernel writing to various parts of a chunk of memory (not atomically i.e. not using atomic* functions), but also have a second kernel only reading those same bytes in a non-tearing manner. I am okay with slightly stale data in my read (given the writing kernel flushes/updates the memory such that proceeding read kernels/processes get the update within a few milliseconds). The write kernel launches and completes after 4-8 ms or so.
At what point in the life cycle of the kernel does the iGPU update the DRAM with the cached values (given we are NOT using atomic writes)? Is it simply always at the end of the kernel execution, or at some other point?
Can/should pinned memory be used for this use case, or would unified be more appropriate such that I can take advantage of the cache safety within the iGPU?
According to the Memory Management section here we see that the iGPU access to pinned memory is Uncached. Does this mean we cannot trust the iGPU to still have safe access like Robert said above?
If using pinned, and a non-atomic write and read occur at the same time, what is the outcome? Is this undefined/segfault territory?
Additionally if using pinned and an atomic write and read occur at the same time, what is the outcome?
My goal is to remove the use of cpu side mutexing around the memory being used by my various kernels since this is causing a coupling/slow-down of two parts of my system.
Any advice is much appreciated. TIA.

Most generally correct way of updating a vertex buffer in Vulkan

Assume a vertex buffer in device memory and a staging buffer that's host coherent and visible. Also assume a desktop system with a discrete GPU (so separate memories). And lastly, assume correct inter-frame synchronization.
I see two general possible ways of updating a vertex buffer:
Map + memcpy + unmap into the staging buffer, followed by a transient (single command) command buffer that contains a vkCmdCopyBuffer, submit it to the graphics queue and wait for the queue to idle, then free the transient command buffer. After that submit the regular frame draw queue to the graphics queue as usual. This is the code used on https://vulkan-tutorial.com (for example, this .cpp file).
Similar to above, only instead use additional semaphores to signal after the staging buffer copy submit, and wait in the regular frame draw submit, thus skipping the "wait-for-idle" command.
#2 sort of makes sense to me, and I've repeatedly read not to do any "wait-for-idle" operations in Vulkan because it synchronizes the CPU with the GPU, but I've never seen it used in any tutorial or example online. What do the pros usually do if the vertex buffer has to be updated relatively often?
First, if you allocated coherent memory, then you almost certainly did so in order to access it from the CPU. Which requires mapping it. Vulkan is not OpenGL; there is no requirement that memory be unmapped before it can be used (and OpenGL doesn't even have that requirement anymore).
Unmapping memory should only ever be done when you are about to delete the memory allocation itself.
Second, if you think of an idea that involves having the CPU wait for a queue or device to idle before proceeding, then you have come up with a bad idea and should use a different one. The only time you should wait for a device to idle is when you want to destroy the device.
Tutorial code should not be trusted to give best practices. It is often intended to be simple, to make it easy to understand a concept. Simple Vulkan code often gets in the way of performance (and if you don't care about performance, you shouldn't be using Vulkan).
In any case, there is no "most generally correct way" to do most things in Vulkan. There are lots of definitely incorrect ways, but no "generally do this" advice. Vulkan is a low-level, explicit API, and the result of that is that you need to apply Vulkan's tools to your specific circumstances. And maybe profile on different hardware.
For example, if you're generating completely new vertex data every frame, it may be better to see if the implementation can read vertex data directly from coherent memory, so that there's no need for a staging buffer at all. Yes, the reads may be slower, but the overall process may be faster than a transfer followed by a read.
Then again, it may not. It may be faster on some hardware, and slower on others. And some hardware may not allow you to use coherent memory for any buffer that has the vertex input usage at all. And even if it's allowed, you may be able to do other work during the transfer, and thus the GPU spends minimal time waiting before reading the transferred data. And some hardware has a small pool of device-local memory which you can directly write to from the CPU; this memory is meant for these kinds of streaming applications.
If you are going to do staging however, then your choices are primarily about which queue you submit the transfer operation on (assuming the hardware has multiple queues). And this primarily relates to how much latency you're willing to endure.
For example, if you're streaming data for a large terrain system, then it's probably OK if it takes a frame or two for the vertex data to be usable on the GPU. In that case, you should look for an alternative, transfer-only queue on which to perform the copy from the staging buffer to the primary memory. If you do, then you'll need to make sure that later commands which use the eventual results synchronize with that queue, which will need to be done via a semaphore.
If you're in a low-latency scenario where the data being transferred needs to be used this frame, then it may be better to submit both to the same queue. You could use an event to synchronize them rather than a semaphore. But you should also endeavor to put some kind of unrelated work between the transfer and the rendering operation, so that you can take advantage of some degree of parallelism in operations.

In-memory function calls

What are in-memory function calls? Could someone please point me to some resource discussing this technique and its advantages. I need to learn more about them and at the moment do not know where to go. Google does not seem to help as it takes me to the domain of cognition and nervous system etc..
Assuming your explanatory comment is correct (I'd have to see the original source of your question to know for sure..) it's probably a matter of either (a) function binding times or (b) demand paging.
Function Binding
When a program starts, the linker/loader finds all function references in the executable file that aren't resolvable within the file. It searches all the linked libraries to find the missing functions, and then iterates. At least the Linux ld.so(8) linker/loader supports two modes of operation: LD_BIND_NOW forces all symbol references to be resolved at program start up. This is excellent for finding errors and it means there's no penalty for the first use of a function vs repeated use of a function. It can drastically increase application load time. Without LD_BIND_NOW, functions are resolved as they are needed. This is great for small programs that link against huge libraries, as it'll only resolve the few functions needed, but for larger programs, this might require re-loading libraries from disk over and over, during the lifetime of the program, and that can drastically influence response time as the application is running.
Demand Paging
Modern operating system kernels juggle more virtual memory than physical memory. Each application thinks it has access to an entire machine of 4 gigabytes of memory (for 32-bit applications) or much much more memory (for 64-bit applications), regardless of the actual amount of physical memory installed in the machine. Each page of memory needs a backing store, a drive space that will be used to store that page if the page must be shoved out of physical memory under memory pressure. If it is purely data, the it gets stored in a swap partition or swap file. If it is executable code, then it is simply dropped, because it can be reloaded from the file in the future if it needs to be. Note that this doesn't happen on a function-by-function basis -- instead, it happens on pages, which are a hardware-dependent feature. Think 4096 bytes on most 32 bit platforms, perhaps more or less on other architectures, and with special frameworks, upwards of 2 megabytes or 4 megabytes. If there is a reference for a missing page, the memory management unit will signal a page fault, and the kernel will load the missing page from disk and restart the process.

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability
I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.
I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.
Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.
Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.