When is memory scratch space 15 used in BPF (Berkeley Packet Filter) or tcpdump? - tcpdump

My question is regarding the tcpdump command..
The command "tcpdump -i eth1 -d" list out the assembly instructions involved in the filter..
I am curious to see that no instruction is accessing M[15] (memory slot 15).
Can someone let me know , are there any filters for which this memory slot is used ?
What is it reserved for and how is it used ?

Memory slots aren't assigned to specific purposes; they're allocated dynamically by pcap_compile() as needed.
For most filters on most network types, pcap_compile()'s optimizer will remove all memory slot uses, or, at least, reduce them so that the code doesn't need 16 memory slots.
For 802.11 (native 802.11 that you see in monitor mode, not the "fake Ethernet" you get when not in monitor mode), the optimizer currently isn't used (it's designed around assumptions that don't apply to the more complicated decision making required to handle 802.11, and fixing it is a big project), so you'll see more use of memory locations. However, you'll probably need a very complicated filter to use M[15] - or M[14] or M[13] or most of the lower-address memory location.
(You can also run tcpdump with the -O option to disable the optimizer.)

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

Does the position of a function/method in a program matters in case of increasing/ decreasing the speed at lower level(memory)?

Let say I write a program which contains many functions/methods. In this program some functions are used many times as compared to others.
In this case does the positioning of a function/method matters in terms of altering the speed at lower level(memory).
As currently, I am learning Computer Organization & Architecture, so this doubt arrived in my mind.
RAM itself is "flat", equal performance at any address (except for NUMA local vs. remote memory in a multi-socket machine, or mixed-size DIMMs on a single socket leading to only partial dual-channel benefits1).
i-cache and iTLB locality can make a difference, so grouping "hot" functions together can be useful even if you don't just inline them.
Locality also matters for demand paging of code in from disk: If a whole block of your executable is "cold", e.g. only needed for error handling, program startup doesn't have to wait for it to get page-faulted in from disk (or even soft page faults if it was hot in the OS's pagecache). Similarly, grouping "startup" code into a page can allow the OS to drop that "clean" page later when it's no longer needed, freeing up physical memory for more caching.
Compilers like GCC do this, putting CRT startup code like _start (which eventually calls main) into a .init section in the same program segment (mapping by the program loader) as .text and .fini, just to group startup code together. Any C++ non-const static-initializer functions would also go in that section.
Footnote 1: Usually; IIRC it's possible for a computer with one 4G and one 8G stick of memory to run dual channel for the first 8GB of physical address space, but only single channel for the last 4, so half the memory bandwidth. I think some real-life Intel chipsets / CPUs memory controllers are like that.
But unless you were making an embedded system, you don't choose where in physical memory the OS loads your program. It's also much more normal for computers to use matched memory on multi-channel memory controllers so the whole range of memory can be interleaved between channels.
BTW, locality matters for DRAM itself: its laid out in a row/column setup, and switching rows takes an extra DDR controller command vs. just reading another column in the same open "page". DRAM pages aren't the same thing as virtual-memory pages; a DRAM page is memory in the same row on the same channel, and is often 2kiB. See What Every Programmer Should Know About Memory? for more details than you'll probably ever want about DDR DRAM, and some really good stuff about cache and memory layout.

Can I fix my GPU clock rate to ensure consistent profiling results?

I want to do some comparative profiling of a couple of CUDA kernels. However, one of them runs within a program which loads the GPU with more work, while the other is only running in a test harness.
For some GPUs, these circumstances mean the clock rates change (perhaps more than one kind of clock rate, because there are several). This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
Is it possible to prevent clock rates from changing due to load (or thermal conditions)?
I've looked into doing this the nvidia-smi utility, which has a sub-command named clocks - but all that does is the following:
clocks -- Control and query clock information.
Usage: nvidia-smi clocks [options]
options include:
[-i | --id]: Enumeration index, PCI bus ID or UUID. Provide comma
separated values for more than one device
[ | --sync-boost-list]: List all synchronous boost groups
[ | --sync-boost-add]: Add a synchronous boost group
[ | --sync-boost-remove]: Remove a synchronous boost group. Provide the group id
returned from --sync-boost-list
... and it doesn't look like that's what I need. Of course, non-nvidia-smi-based solutions are welcome.
Notes:
I'm particularly interested in fixing clock rates for Quadro and Tesla cards, in case that matters.
I can be root if necessary.
Using CUDA 10.2 with its bundled driver. If absolutely necessary, I might be able to switch to a new version.
TL;DR
first, set persistence mode e.g. nvidia-smi -i 0 -pm 1 (sets persistence mode for the GPU index 0)
use a nvidia-smi command like -ac or -lgc (application clocks, lock gpu clock)
there is nvidia-smi command line help for all of this nvidia-smi --help
this functionality may not work on your GPU. Install the latest driver, and also some of this functionality is simply not available on certain products
these settings often require root privilege, or admin privilege on windows
any of this description is subject to change. With some care, the command-line help for the version you are using should be instructive
LONGER:
I'm using driver 455.23.05 for this description. Some features (e.g. -lgc) may not be available in older drivers. Setting persistence mode may be necessary for some of these features, and will also help to reduce variability on application start-up. This is not intended to be an exhaustive description of the nvidia-smi tool.
SETTING APPLICATION CLOCKS:
The application clocks feature should generally be useful for the testing described. It will not force the GPU clocks to remain at the specified setting when there is no application running (AFAIK), but the clocks should attain those values "as soon as" the application starts running. It allows you to specify both gpu clock (i.e. core clock) as well as memory clock. Let's start by excerpting the command line help text for some of the important switches:
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
To get started setting application clocks, you may need to use sudo or similar on linux for some or all of these commands. Also note above the requirement for elevated privilege can be turned on/off. Also important is that you cannot pick any values you like for <memory,graphics> settings pair. You must specify a pair, and furthermore the pair can only come from a list of permissible options. Other choices will result in unspecified behavior. These choices can be determined from the --query-supported-clocks switch (use --help-query-supported-clocks to get command-line help on that switch) to nvidia-smi which itself requires some formatting. For example, the following command will give an exhaustive list of the valid pairs that can be passed to the -ac command:
nvidia-smi -i 0 --query-supported-clocks=mem,gr --format=csv
Once you have that list of valid pairs, you can specify one of those pairs to the application clocks command:
nvidia-smi -i 0 -ac 877,1215
(The above command, if run with root or enabled via -acp would set the memory clock to 877MHz and the core clock to 1215MHz on my Tesla V100, for example. Note the -i switch to select the GPU to target with this command. The 877,1215 pair may not be valid on your GPU. Also note that the -acp feature is removed from drivers 465.xx and newer.)
When you are done with whatever you are doing, you may wish to reset the application clock behavior to the default behavior (GPU selects clock freqs according to its own heuristics) using -rac.
Also, a number of the pairs offered may involve "boosting" behavior. The GPU is not guaranteed to maintain all clocks exactly as you specify, if a throttling event occurs. Typical throttling events are:
GPU is consuming too much electrical power
GPU temperature is too high
The existence of an actual throttling event can be discovered using the "full" output from nvidia-smi (nvidia-smi -a), look for "clocks throttle reasons". Other useful information is available in this output such as the default application clocks. When N/A appears in your output, it means that your GPU does not support this feature. There is a great variety of supported features across various GPU families, I won't be able to respond to questions about this.
In the absence of a throttling event, and assuming your GPU supports the feature, I would expect application clocks to remain in effect throughout your application runtime. Note that if this command is specified while an application is currently running, the change in clocks may not take effect until the GPU becomes idle. You may wish to monitor GPU clocks in this case (again, using nvidia-smi). Therefore I would generally recommend using these commands when the GPU is idle. Then begin your work on the GPU after that.
LOCK GPU (CORE) CLOCK:
In many cases, the gpu core clock (core, gpu, graphics are all synonyms in this context) exhibits the most variability (for example the application clocks offered on my Tesla V100 only include a value of 877MHz for memory clock; no other choices are possible). There is a separate switch that can be used to "lock" the GPU core clock to a range of values.
-lgc --lock-gpu-clocks= Specifies <minGpuClock,maxGpuClock> clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. <GpuClockValue>).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
This range is specified using a lower and upper endpoint for the range. If you wish to select a specific value only, you can specify the lower and upper endpoints both to be that value. As far as I know the range endpoints are inclusive.
For example, the following command:
nvidia-smi -i 0 -lgc 1215,1215
will "lock" the GPU core clock to 1215 MHz on my Tesla V100 GPU. As far as I know, this effect takes place immediately, even if an application is running. Most other caveats I can think of should be similar for application clocks:
choose a valid GPU core clock, as output from the --query-supported-clocks command
GPU is not guaranteed to maintain the request in the event of throttling
elevated privilege is required
reset the behavior with -rgc
As indicated in the help, this switch "overrides" previous application clocks settings with respect to core clock. Also, note that many switches come in 2 flavors, a "long" form and a "short" form. Where additional switch parameters are required, the long form often requires an = separator, the short form often requires a space separator:
nvidia-smi -i 0 -lgc 1215,1215
or
nvidia-smi -i 0 -lock-gpu-clocks=1215,1215
you generally cannot intermix this formatting:
nvidia-smi -i 0 -lgc=1215,1215
will probably report an error.
A FINAL NOTE:
This effect is particularly severe in devices like Tesla T4's (which aren't actively cooled).
In my experience with T4, a possible observation is throttling. The T4 GPU is one of the lowest power datacenter-grade GPUs, and its certainly possible for the GPU compute demands to exceed what the power limits (70W) can support. In this case, the GPU clocks will throttle, and none of the above commands will allow you to override this behavior. By design, you cannot force the GPU to operate at elevated clocks when the GPU is trying to protect itself, or protect the system it is running in.
Also, the fact that a T4 is not actively cooled really should not matter. The only approved/supported usage setting for a T4 is in a server that is designed to handle the T4. (A similar statement is true for any NVIDIA Datacenter GPU). Such servers monitor the T4 GPU temperature and provide server-delivered forced flow-through cooling to the GPU. This is by design. The server is responsible for keeping the GPU in a proper temperature operating range. If the server is not doing that, you should address that with your server vendor. If you are operating the T4 GPU in a non-approved setting (such as a non-qualified server, or a desktop/workstation) then I would generally expect the experience with that device to be dismal.
MORE RECENTLY: NVIDIA has published this blog which covers many of the same topics. If there are discrepancies between what I have stated above, and the blog, the blog should be considered the best source.

How to count the number of guest instructions QEMU executed from the beginning to the end of a run?

I want to benchmark guest instructions per second of QEMU to compare it with other simulators.
How to obtain the guest instruction count? I'm interested both in user and full system mode.
The only solutions I have now would be to log all instructions with either simple trace exec_tb or -d in_asm: How to use QEMU's simple trace backend? and then count the instructions from there. But this would likely considerably reduce simulation performance due to the output operations, so I would likely have to run the test program twice, one with and another without the trace, and hope that both executions are similar (should be, especially for single threaded user mode simulation).
I saw the -icount option, which sounds promising from the name, but when I passed it to QEMU 4.0.0, I didn't see anything happen. Should it print an instruction count somewhere? The following patch appears unmerged and suggests not: https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg01275.html
Basic Profiling
To follow up on Peter's answer, I have recently run into a situation where I wanted to get the instruction count of a program run under QEMU (I'm using v4.2.0, the first where plugins became available).
One of the example plugins, insn.c, does exactly what you want, and returns the count of executed instructions on plugin exit.
(I assume you already know how to run QEMU, so I'll strip this down to the important flags)
qemu-system-arm ... -plugin qemu-install-dir/build/tests/plugin/libinsn.so,arg=inline -d plugin
The first part loads the plugin and passes a single argument, "inline" to it. The next part enables printing of the plugin. You can redirect the plugin output to a different file by adding -D filename to the command line invocation.
More Advanced Profiling
When I was looking for possible ways to profile a program run under QEMU, this is one of the only results of my search that was promising. In the spirit of creating a good record for other searching in the future, here are some links to code that I have written to do just that.
Profiling Plugin code, docs.
Disclaimer: I wrote the above code.
Current released versions of QEMU don't provide any means for doing this. The upcoming "TCG plugin" support which should go out in the 4.2 release at the end of the year would allow you to write a simple "count the instructions executed" plugin, but this (as with the -d tracing) will add an overhead.
The -icount option is certainly confusing, but what it does is make the emulated CPU (try to) run at a specific number of executed instructions per second, as opposed to the default of "as fast as possible". This has higher overhead (and it will stop QEMU using multiple host threads for SMP guests), but is more deterministic.
Philosophically speaking, "instructions per second" is a rather misleading metric for emulators, because the time taken to execute an instruction can vary vastly compared to hardware. Loads and stores are rather slower than on real hardware. Floating point instructions are incredibly slow (perhaps a factor of 10 or worse of an integer arithmetic instruction, where real hardware could execute both in one cycle). JIT emulators like QEMU have a start-stop performance profile where execution stops entirely while we translate a block of code, whereas a real CPU or an interpreting emulator will not have these pauses. How much effect the JIT time has will depend on whether your code reruns previously translated hot code frequently or if it spends most of its time running "new" code, and whether it does things that result in the JIT having to discard the old code (eg self modifying code, or frequent between-process context switches). If you had an "IPS meter" on your emulator you'd see the value it reported fluctuate wildly as the guest code executed and did different things. You're probably better off just picking a benchmark which you think is representative of your actual use case, running it on various emulators, and comparing the wall-clock time it takes to complete.

In-memory function calls

What are in-memory function calls? Could someone please point me to some resource discussing this technique and its advantages. I need to learn more about them and at the moment do not know where to go. Google does not seem to help as it takes me to the domain of cognition and nervous system etc..
Assuming your explanatory comment is correct (I'd have to see the original source of your question to know for sure..) it's probably a matter of either (a) function binding times or (b) demand paging.
Function Binding
When a program starts, the linker/loader finds all function references in the executable file that aren't resolvable within the file. It searches all the linked libraries to find the missing functions, and then iterates. At least the Linux ld.so(8) linker/loader supports two modes of operation: LD_BIND_NOW forces all symbol references to be resolved at program start up. This is excellent for finding errors and it means there's no penalty for the first use of a function vs repeated use of a function. It can drastically increase application load time. Without LD_BIND_NOW, functions are resolved as they are needed. This is great for small programs that link against huge libraries, as it'll only resolve the few functions needed, but for larger programs, this might require re-loading libraries from disk over and over, during the lifetime of the program, and that can drastically influence response time as the application is running.
Demand Paging
Modern operating system kernels juggle more virtual memory than physical memory. Each application thinks it has access to an entire machine of 4 gigabytes of memory (for 32-bit applications) or much much more memory (for 64-bit applications), regardless of the actual amount of physical memory installed in the machine. Each page of memory needs a backing store, a drive space that will be used to store that page if the page must be shoved out of physical memory under memory pressure. If it is purely data, the it gets stored in a swap partition or swap file. If it is executable code, then it is simply dropped, because it can be reloaded from the file in the future if it needs to be. Note that this doesn't happen on a function-by-function basis -- instead, it happens on pages, which are a hardware-dependent feature. Think 4096 bytes on most 32 bit platforms, perhaps more or less on other architectures, and with special frameworks, upwards of 2 megabytes or 4 megabytes. If there is a reference for a missing page, the memory management unit will signal a page fault, and the kernel will load the missing page from disk and restart the process.