After building thredx with TX_THREAD_ENABLE_PERFORMANCE_INFO option and linking to the project the processor hangs up.
It is working fine without this option.
What are the requirements (stack, ram, flash etc.) for this option?
Threadx and the application should be compile with TX_THREAD_ENABLE_PERFORMANCE_INFO defined
The problem raised because I did not compile the application with TX_THREAD_ENABLE_PERFORMANCE_INFO defined (only threadx).
After defining TX_THREAD_ENABLE_PERFORMANCE_INFO for the application also, the problem has been solved.
The size of structures holding data for threads, semaphores and other objects will grow, to accommodate the variables necessary to keep track of the operations monitored. Stack sizes may grow too. The exact numbers will depend on your architecture, compiler, compiler options and other user options of ThreadX. Try increasing the stack sizes that you are using and make sure that the thread/semaphore/etc structures have enough space. You can check the stack size used for each thread in the TX_THREAD structure for each thread directly in the debugger or using the IDEs OS support for ThreadX if available in your tool.
Related
I'm looking for a way to effectively virtualize a single process (and, presumably, any children it creates). Although a Container model sounds appropriate, products like Docker don't quite fit the bill.
The Intel VMX model allows one to create a hypervisor, then launch a VM, which will exit back to the hypervisor under certain (programmable) conditions, such as privileged instruction execution, CR3/CR8 manipulation, exceptions, direct I/O, etc. The features of the VMX model fit very well with my needs, except that I can't require a VM with a separate instance of the entire OS to accomplish the task - I just want my hypervisor to control one child application (think Photoshop/Excel/Firefox; one process and its progeny, if any) that's running under the host OS, and catch VM exits under the specified conditions (for debugging and/or emulation purposes). Outside of the exit conditions, the child process should run unencumbered, and have access to all OS resources to which it would be entitled without the VM, including filesystem, graphical output, keyboard/mouse input, IPC/messaging, etc. For my purposes, I am not interested in isolation or access restriction, which is the typical motivation for using a VM - to the contrary, I want the child process to be fully enmeshed in the host OS environment. While operating entirely in user-space is preferable, I can utilize Ring 0 to facilitate this. (Note that the question is Intel-specific and OS-agnostic, although it's likely to be implemented in a *nix environment first.)
I'm wondering what would happen if I had my hypervisor set up a VMCS that simply mirrored the host's actual configuration, including page tables, IDT, etc., then VMLAUNCH 0(%rip) (in effect, a pseudo-fork?) and execute the child process from there. (That seems far too simplistic to actually work, but the notion does have some appeal). Assuming that's a Bad Idea™, how might I approach this problem?
Is passing a pointer to cudaHostRegister that's not page aligned allowed/portable? I'm asking because the simpleStream example does manual page-aligment, but I can't find this requirement in the documentation. Maybe it's a portability problem (similar to mlock() supporting non-aligned on linux, but POSIX does not in general)?
I changed to bandwidth test and using non-aligned, but registered memory performs the same as that returned by cudaHostAlloc. Since I use these pinned buffers for overlapping copies and computation, I'm also interested in whether non-alignment prevents that (so far I could not detect a performance loss).
All my tests were on x86-64 linux.
Maybe it's a portability problem (similar to mlock() supporting non-aligned on linux, but POSIX does not in general)?
Both Linux's mlock and Windows' VirtualLock will lock all pages containing a byte or more of the address range you want to lock, manual alignment is not needed. But as you noted, POSIX allows for an implementation to require the argument of mlock to be page-aligned. This is notably the case on OS X's mlock which will round up a page-unaligned address to the next page boundary, therefore not locking the entirety of the address range.
The documentation of cudaHostRegister makes no mention of any alignment constraint on its arguments. As such, a consumer of this API would be in right to expect that any concern of alignment on the underlying platform is the responsibility of cudaHostRegister, not the user. But without seeing the source of cudaHostRegister, it's impossible to tell if this is actually the case. As the sample is deliberately manually taking care of alignment, it is possible that cudaHostRegister doesn't have such transparent alignment-fixing functionality.
Therefore, yes, it is likely the sample was written to ensure its portability across OSes supported by CUDA (Windows, Linux, Mac OS X).
I just found the following lines in the old 4.0 NVIDIA Library... Maybe it can be helpful for future questions:
The CUDA context must have been created with the cudaMapHost flag in order for the cudaHostRegisterMapped flag to have any effect.
The cudaHostRegisterMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostRegisterPortable flag.
and finally
The pointer ptr and size size must be aligned to the host page size (4 KB).
so it is about the host page size.
I have a question referring to Griffon.
Is there a way to decrease memory consumption of griffon applications?
Actually the sample griffon application process (just single window with
label) takes in Windows ~80MB. Is there a way to change something to
visibly decrease this basic memory usage?
Griffon is a great solution etc. but my customer complains that a simple
application takes such an amount of memory (more than e.g. Word, Outlook,
or most of complicated Java application - comparable with whole).
A barebones Griffon app (just by calling create-app and nothing more) reports 49M of memory usage. In terms of file size it's a bit above 7M. Whereas a java base Griffon app (griffon create-app sample --file-type=java) rises up to 42M of memory usage; same file size.
This is of course using default settings provided by the run-app command. Further memory configuration settings may be applied to limit and streamline resource consumption.
What are in-memory function calls? Could someone please point me to some resource discussing this technique and its advantages. I need to learn more about them and at the moment do not know where to go. Google does not seem to help as it takes me to the domain of cognition and nervous system etc..
Assuming your explanatory comment is correct (I'd have to see the original source of your question to know for sure..) it's probably a matter of either (a) function binding times or (b) demand paging.
Function Binding
When a program starts, the linker/loader finds all function references in the executable file that aren't resolvable within the file. It searches all the linked libraries to find the missing functions, and then iterates. At least the Linux ld.so(8) linker/loader supports two modes of operation: LD_BIND_NOW forces all symbol references to be resolved at program start up. This is excellent for finding errors and it means there's no penalty for the first use of a function vs repeated use of a function. It can drastically increase application load time. Without LD_BIND_NOW, functions are resolved as they are needed. This is great for small programs that link against huge libraries, as it'll only resolve the few functions needed, but for larger programs, this might require re-loading libraries from disk over and over, during the lifetime of the program, and that can drastically influence response time as the application is running.
Demand Paging
Modern operating system kernels juggle more virtual memory than physical memory. Each application thinks it has access to an entire machine of 4 gigabytes of memory (for 32-bit applications) or much much more memory (for 64-bit applications), regardless of the actual amount of physical memory installed in the machine. Each page of memory needs a backing store, a drive space that will be used to store that page if the page must be shoved out of physical memory under memory pressure. If it is purely data, the it gets stored in a swap partition or swap file. If it is executable code, then it is simply dropped, because it can be reloaded from the file in the future if it needs to be. Note that this doesn't happen on a function-by-function basis -- instead, it happens on pages, which are a hardware-dependent feature. Think 4096 bytes on most 32 bit platforms, perhaps more or less on other architectures, and with special frameworks, upwards of 2 megabytes or 4 megabytes. If there is a reference for a missing page, the memory management unit will signal a page fault, and the kernel will load the missing page from disk and restart the process.
I have a VxWorks application running on ARM uC.
First let me summarize the application;
Application consists of a 3rd party stack and a gateway application.
We have implemented an operating system abstraction layer to support OS in-dependency.
The underlying stack has its own memory management&control facility which holds memory blocks in a doubly linked list.
For instance ; we don't directly perform malloc/new , free/delege .Instead we call OSA layer's routines and it gets the memory from OS and puts it in a list then returns this memory to application.(routines : XXAlloc , XXFree,XXReAlloc)
And when freeing the memory we again use XXFree.
In fact this block is a struct which has
-magic numbers indication the beginning and end of memory block
-size that user requested allocated
-size in reality due to alignment issue previous and next pointers
-pointer to piece of memory given back to application. link register that shows where in the application xxAlloc is called.
With this block structure stack can check if a block is corrupted or not.
Also we have pthread library which is ported from Linux that we use to
-create/terminate threads(currently there are 22 threads)
-synchronization objects(events,mutexes..)
There is main task called by taskSpawn and later this task created other threads.
this was a description of application and its VxWorks interface.
The problem is :
one of tasks suddenly gets destroyed by VxWorks giving no information about what's wrong.
I also have a jtag debugger and it hits the VxWorks taskDestoy() routine but call stack doesn't give any information neither PC or r14.
I'm suspicious of specific routine in code where huge xxAlloc is done but problem occurs
very sporadic giving no clue that I can map it to source code.
I think OS detects and exception and performs its handling silently.
any help would be great
regards
It resolved.
I did an isolated test. Allocated 20MB with malloc and memset with 0x55 and stopped thread of my application.
And I wrote another thread which checks my 20MB if any data else than 0x55 is written.
And quess what!! some other thread which belongs other components in CPU (someone else developed them) write my allocated space.
Thanks 4 your help
If your task exits, taskDestroy() is called. If you are suspicious of huge xxAlloc, verify that the allocation code is not calling exit() when memory is exhausted. I've been bitten by this behavior in a third party OSAL before.
Sounds like you are debugging after integration; this can be a hell of a job.
I suggest breaking the problem into smaller pieces.
Process
1) you can get more insight by instrumenting the code and/or using VxWorks intrumentation (depending on which version). This allows you to get more visibility in what happens. Be sure to log everything to a file, so you move back in time from the point where the task ends. Instrumentation is a worthwile investment as it will be handy in more occasions. Interesting hooks in VxWorks: Taskhooklib
2) memory allocation/deallocation is very fundamental functionality. It would be my first candidate for thorough (unit) testing in a well-defined multi-thread environment. If you have done this and no errors are found, I'd first start to look why the tas has ended.
other possible causes
A task will also end when the work is done.. so it may be a return caused by a not-so-endless loop. Especially if it is always the same task, this would be my guess.
And some versions of VxWorks have MMU support which must be considered.