I have a Windows CE 5.0 device driver that protects some resources with critical sections. Threads from the client processes migrates to device.exe and enters and leaves this critical section. These Enter/Leaves can be hierarchically.
When one of my driver's client processes has more than one thread and
one thread A has entered a critical section in the driver
another thread B performs something (like division by zero) that terminates the process by an exception
what happens with the critical section. The driver DLL is loaded in the process device.exe and won't be unloaded.
But what happens to the the critical sections? Can other threads now enter? What happens with any resources the have been allocated by the thread A?
[Any pointer to a documentation is welcome.]
Related
From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document
From Computer Organization and Design, by Patterson et al
Why is "I/O device request" external interrupt?
Does "I/O device request" mean that a user program request I/O device services by system calls? If yes, isn't a system call an internal exception?
Thanks.
It's referring to peripheral devices signaling that they require attention, eg. disk controller hardware that is now ready to satisfy a read request that it received earlier, (or has finished DMAing in data for the read request).
The path in to the operating system is an array of pointers. This carry may have different names depending upon the system. I will call it the "dispatch table." The dispatch table handles everything that needs the attention of the operating system: Interrupts, faults, and traps. The last two are collectively "exceptions".
An exception is caused by executing an instruction. They synchronous.
An interrupt is as caused by by something occurring outside the executing process/thread.
A user invokes the the operating system synchronously by executing an instruction that causes a trap (On intel chips they misname such a trap a "software interrupt"). Such an even is a synchronous, predictable result of the instruction stream.
Such a trap would be used to queue an I/O request to the device. "Invoke the operating system from user program" in your table.
The device wold cause an interrupt when the request is completed. That what is meant by an "I/O Device Request" in your table.
The confusion is that interrupts, faults and traps are all handled the same way by the operating system through the dispatch table. And, as I said, in Intel land they call both traps and interrupts "Interrupts".
Because the interrupt isn't generated by the processor or the program. It is a physical wire connected to the interrupt controller whose state changes. Driven by the controller for the device, external to the processor. The interrupt handler is usually located in a driver that knows how to handle the device controller's request for service.
"Invoke the operating system" is a software interrupt, usually switches the processor into protected mode to handle the request.
"Arithmetic overflow" is typically a trap that's generated by the floating point unit on the processor.
"Using an undefined instruction" is another trap, generated by the processor itself when it can't execute code anymore because the instruction is invalid.
Processor usually have more traps like that. Like division by zero. Or executing a privileged instruction. Or a page fault when virtual memory isn't mapped to physical memory yet. Or a protection fault when the program reads an unmapped virtual memory address.
I am giving very limited information, I know, but it is probably enough.
System specs: MIPS 64-bit processor with 8 cores running 4 virtual cpus each.
OS: Some proprietary Linux based on the 2.6.32.9 kernel
Process: a simple user-land process running 7 posix threads. This specific application is running on core 0, which does not have any cpu affinity by any process.
The crash is almost impossible to reproduce. There is no specific scenario. We know that, if we perform some minor activity with the application it might crash once a day.
The specific thread that's crashing wakes up every 5 milliseconds, reads information from one shared memory area and updates another. That's it.
There is not too much activity. The process is not working too hard.
Now: When I open the core and load a symbol-less image of the application, gdb points to instruction 100661e0. Instruction 100661e0 looks like this (viewing with an objdump view of the non-stripped image):
void class::foo(uint8_t factor)
{
100661d8: ffbf0018 sd ra,24(sp)
100661dc: 0080802d move s0,a0
bar(factor, shared_memory_a->profiles[PROFILE_1]);
100661e0: 0c019852 jal 10066148 <_ZN28class35barEhR30profile_t>
100661e4: 64c673c8 daddiu a2,a2,29640
bar(factor, shared_memory_a->profiles[PROFILE_2]);
100661e8: de060010 ld a2,16(s0)
The line that shows as the exception line is
100661e0: 0c019852 jal 10066148 <_ZN28class35barEhR30profile_t>
Note that 10066148 is a valid instruction.
The Bad register contains the following address, which is aligned, but does not look valid as far as the instruction address space is concerned: c0000003ddc5dd90
The cause register contains this value: 0000000000800014
I don't understand why the Bad registers shows the value it does, where the instruction specifically states a valid instruction. I am a little concerned about branch delay slot issues, but I shouldn't be concerned about those when I am running a userland application using simple C++.
I'd appreciate any thoughts.
My program runs 2 threads - Thread A (for input) and B (for processing). I also have a pair of pointers to 2 buffers, so that when Thread A has finished copying data into Buffer 1, Thread B starts processing Buffer 1 and Thread A starts copying data into Buffer 2. Then when Buffer 2 is full, Thread A copies data into Buffer 1 and Thread B processes Buffer 2, and so on.
My problem comes when I try to cudaMemcpy Buffer[] into d_Buffer (which was previously cudaMalloc'd by the main thread, i.e. before thread creation. Buffer[] were also malloc'd by the main thread). I get a "invalid argument" error, but have no idea which is the invalid argument.
I've reduced my program to a single-threaded program, but still using 2 buffers. That is, the copying and processing takes place one after another, instead of simultaneously. The cudaMemcpy line is exactly the same as the double-threaded one. The single-threaded program works fine.
I'm not sure where the error lies.
Thank you.
Regards,
Rayne
If you are doing this with CUDA 3.2 or earlier, the reason is that GPU contexts are tied to a specific thread. If a multi-threaded program allocated memory on the same GPU from different host threads, the allocations wind up establishing different contexts, and pointers from one context are not portable to another context. Each context has its own "virtualised" memory space to work with.
The solution is to either use the context migration API to transfer a single context from thread to thread as they do work, or try the new public CUDA 4.0rc2 release, which should support what you are trying to do without the use of context migration. The downside is that 4.0rc2 is a testing release, and it requires a particular beta release driver. That driver won't work will all hardware (laptops for example).
I have a VxWorks application running on ARM uC.
First let me summarize the application;
Application consists of a 3rd party stack and a gateway application.
We have implemented an operating system abstraction layer to support OS in-dependency.
The underlying stack has its own memory management&control facility which holds memory blocks in a doubly linked list.
For instance ; we don't directly perform malloc/new , free/delege .Instead we call OSA layer's routines and it gets the memory from OS and puts it in a list then returns this memory to application.(routines : XXAlloc , XXFree,XXReAlloc)
And when freeing the memory we again use XXFree.
In fact this block is a struct which has
-magic numbers indication the beginning and end of memory block
-size that user requested allocated
-size in reality due to alignment issue previous and next pointers
-pointer to piece of memory given back to application. link register that shows where in the application xxAlloc is called.
With this block structure stack can check if a block is corrupted or not.
Also we have pthread library which is ported from Linux that we use to
-create/terminate threads(currently there are 22 threads)
-synchronization objects(events,mutexes..)
There is main task called by taskSpawn and later this task created other threads.
this was a description of application and its VxWorks interface.
The problem is :
one of tasks suddenly gets destroyed by VxWorks giving no information about what's wrong.
I also have a jtag debugger and it hits the VxWorks taskDestoy() routine but call stack doesn't give any information neither PC or r14.
I'm suspicious of specific routine in code where huge xxAlloc is done but problem occurs
very sporadic giving no clue that I can map it to source code.
I think OS detects and exception and performs its handling silently.
any help would be great
regards
It resolved.
I did an isolated test. Allocated 20MB with malloc and memset with 0x55 and stopped thread of my application.
And I wrote another thread which checks my 20MB if any data else than 0x55 is written.
And quess what!! some other thread which belongs other components in CPU (someone else developed them) write my allocated space.
Thanks 4 your help
If your task exits, taskDestroy() is called. If you are suspicious of huge xxAlloc, verify that the allocation code is not calling exit() when memory is exhausted. I've been bitten by this behavior in a third party OSAL before.
Sounds like you are debugging after integration; this can be a hell of a job.
I suggest breaking the problem into smaller pieces.
Process
1) you can get more insight by instrumenting the code and/or using VxWorks intrumentation (depending on which version). This allows you to get more visibility in what happens. Be sure to log everything to a file, so you move back in time from the point where the task ends. Instrumentation is a worthwile investment as it will be handy in more occasions. Interesting hooks in VxWorks: Taskhooklib
2) memory allocation/deallocation is very fundamental functionality. It would be my first candidate for thorough (unit) testing in a well-defined multi-thread environment. If you have done this and no errors are found, I'd first start to look why the tas has ended.
other possible causes
A task will also end when the work is done.. so it may be a return caused by a not-so-endless loop. Especially if it is always the same task, this would be my guess.
And some versions of VxWorks have MMU support which must be considered.