MPU subregions security for STM32H7 - threadx

I am trying to understand the STM32H7 MPU example,
In this example, only one region has been created for all the memory address space 4GB.
The subregion option is activated which means, it will be divided into 8 subregions starting from 0x0.
And SRD is set to 0x87 which means the MPU will be enabled only on:
On-chip peripheral address space,
external RAM,
Shared device space.
This seems strange to me because we kept the most important address space unprotected for Flash, SRAM, System, and Non-shareable devices.
Any explanation of the reason the subregions were configured that way?

When a ThreadX Module thread is scheduled, the MPU is reconfigured such that the module can only access its code and data memory.

One background region is created during initialization. This region is the only active one for privileged code. Module specific regions are configured every time there is a task switch into user code. More information here:
https://developer.arm.com/documentation/dui0646/c/Cortex-M7-Peripherals/Optional-Memory-Protection-Unit?lang=en

Related

Does the position of a function/method in a program matters in case of increasing/ decreasing the speed at lower level(memory)?

Let say I write a program which contains many functions/methods. In this program some functions are used many times as compared to others.
In this case does the positioning of a function/method matters in terms of altering the speed at lower level(memory).
As currently, I am learning Computer Organization & Architecture, so this doubt arrived in my mind.
RAM itself is "flat", equal performance at any address (except for NUMA local vs. remote memory in a multi-socket machine, or mixed-size DIMMs on a single socket leading to only partial dual-channel benefits1).
i-cache and iTLB locality can make a difference, so grouping "hot" functions together can be useful even if you don't just inline them.
Locality also matters for demand paging of code in from disk: If a whole block of your executable is "cold", e.g. only needed for error handling, program startup doesn't have to wait for it to get page-faulted in from disk (or even soft page faults if it was hot in the OS's pagecache). Similarly, grouping "startup" code into a page can allow the OS to drop that "clean" page later when it's no longer needed, freeing up physical memory for more caching.
Compilers like GCC do this, putting CRT startup code like _start (which eventually calls main) into a .init section in the same program segment (mapping by the program loader) as .text and .fini, just to group startup code together. Any C++ non-const static-initializer functions would also go in that section.
Footnote 1: Usually; IIRC it's possible for a computer with one 4G and one 8G stick of memory to run dual channel for the first 8GB of physical address space, but only single channel for the last 4, so half the memory bandwidth. I think some real-life Intel chipsets / CPUs memory controllers are like that.
But unless you were making an embedded system, you don't choose where in physical memory the OS loads your program. It's also much more normal for computers to use matched memory on multi-channel memory controllers so the whole range of memory can be interleaved between channels.
BTW, locality matters for DRAM itself: its laid out in a row/column setup, and switching rows takes an extra DDR controller command vs. just reading another column in the same open "page". DRAM pages aren't the same thing as virtual-memory pages; a DRAM page is memory in the same row on the same channel, and is often 2kiB. See What Every Programmer Should Know About Memory? for more details than you'll probably ever want about DDR DRAM, and some really good stuff about cache and memory layout.

What is the difference between Nvidia Hyper Q and Nvidia Streams?

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?
Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable to a higher value if you need more.

Native Client inner/outer sandbox

I am dealing with the Chrome Native Client and have some difficulties in the following points:
As I understood so far, the first 64 KB of the 256MB Nacl segment are dedicated to the inner-sandbox. This inner sandbox contains the trampoline and the springboard which communicate from the trusted code to the untrusted and vice versa. When I am in this first 64 KB, can I jump to the middle of 32 byte instructions? for example, if I have a 32 byte instruction in the trampoline, can I jump from this instr to the middle (not 32 bytes aligned) of another 32 byte intruction in the trampoline? Do all the instructions in the trampiline and the springboard are also 32 byte aligned?
Can I combine several x86 instructions into one 32 bytes aligned Nacl instruction (for example, putting AND 0xffffffe0 %eax and JMP EAX in one 32 byte aligned Nacl instruction).
I understood that the service runtime is dealing with process creating, memory management etc and that it is accessed through the trampoline, how exactly the trampoline instruction accesses the service runtime? where the service runtime is located in the memory platform? when the service runtime finishes, can it access not 32-byte aligned instruction in the springboard?
What the actual duty of the outer sandbox? how does it monitor and filter the system calls? if there is a bug in the validator of the inner sandbox, in what cases it can catch illegal/malicious instruction?
Thank you all
I'm not 100% sure of the top of my head, but I would guess from looking just at the directory layout of the source that they are both part of the trusted service runtime code (they are in the src/trusted/service_runtime directory), and are therefore built with the system compiler and not subject to validation.
Yes, there is no limit on the number of instructions in a 32-byte bundle. The restriction is just that no instruction (or multi-instruction sandboxing sequence such as the one you mentioned for indirect jumps) may cross the bundle boundary. So in your example, both of those instructions would be required to be in the same bundle.
Again I'm a bit fuzzy on the details of how the trampolines work but when control transfers from the trampoline, it ends up in the service runtime, which is just ordinary machine code built according to the native ABIs for the OS. So the service runtime can use any system calls (at least any allowed by the outer sandbox) and can read or execute any part of the untrusted code.
The outer sandbox is, strictly speaking, a defense in depth (i.e. the inner sandbox is in theory sufficient to contain the untrusted code). It filters system calls in different ways on different OSes. In Chrome's embedding of NaCl, the outer sandbox is the same implementation as the Chrome sandbox used for the renderer and GPU processes.

When is memory scratch space 15 used in BPF (Berkeley Packet Filter) or tcpdump?

My question is regarding the tcpdump command..
The command "tcpdump -i eth1 -d" list out the assembly instructions involved in the filter..
I am curious to see that no instruction is accessing M[15] (memory slot 15).
Can someone let me know , are there any filters for which this memory slot is used ?
What is it reserved for and how is it used ?
Memory slots aren't assigned to specific purposes; they're allocated dynamically by pcap_compile() as needed.
For most filters on most network types, pcap_compile()'s optimizer will remove all memory slot uses, or, at least, reduce them so that the code doesn't need 16 memory slots.
For 802.11 (native 802.11 that you see in monitor mode, not the "fake Ethernet" you get when not in monitor mode), the optimizer currently isn't used (it's designed around assumptions that don't apply to the more complicated decision making required to handle 802.11, and fixing it is a big project), so you'll see more use of memory locations. However, you'll probably need a very complicated filter to use M[15] - or M[14] or M[13] or most of the lower-address memory location.
(You can also run tcpdump with the -O option to disable the optimizer.)

In-memory function calls

What are in-memory function calls? Could someone please point me to some resource discussing this technique and its advantages. I need to learn more about them and at the moment do not know where to go. Google does not seem to help as it takes me to the domain of cognition and nervous system etc..
Assuming your explanatory comment is correct (I'd have to see the original source of your question to know for sure..) it's probably a matter of either (a) function binding times or (b) demand paging.
Function Binding
When a program starts, the linker/loader finds all function references in the executable file that aren't resolvable within the file. It searches all the linked libraries to find the missing functions, and then iterates. At least the Linux ld.so(8) linker/loader supports two modes of operation: LD_BIND_NOW forces all symbol references to be resolved at program start up. This is excellent for finding errors and it means there's no penalty for the first use of a function vs repeated use of a function. It can drastically increase application load time. Without LD_BIND_NOW, functions are resolved as they are needed. This is great for small programs that link against huge libraries, as it'll only resolve the few functions needed, but for larger programs, this might require re-loading libraries from disk over and over, during the lifetime of the program, and that can drastically influence response time as the application is running.
Demand Paging
Modern operating system kernels juggle more virtual memory than physical memory. Each application thinks it has access to an entire machine of 4 gigabytes of memory (for 32-bit applications) or much much more memory (for 64-bit applications), regardless of the actual amount of physical memory installed in the machine. Each page of memory needs a backing store, a drive space that will be used to store that page if the page must be shoved out of physical memory under memory pressure. If it is purely data, the it gets stored in a swap partition or swap file. If it is executable code, then it is simply dropped, because it can be reloaded from the file in the future if it needs to be. Note that this doesn't happen on a function-by-function basis -- instead, it happens on pages, which are a hardware-dependent feature. Think 4096 bytes on most 32 bit platforms, perhaps more or less on other architectures, and with special frameworks, upwards of 2 megabytes or 4 megabytes. If there is a reference for a missing page, the memory management unit will signal a page fault, and the kernel will load the missing page from disk and restart the process.