I am giving very limited information, I know, but it is probably enough.
System specs: MIPS 64-bit processor with 8 cores running 4 virtual cpus each.
OS: Some proprietary Linux based on the 2.6.32.9 kernel
Process: a simple user-land process running 7 posix threads. This specific application is running on core 0, which does not have any cpu affinity by any process.
The crash is almost impossible to reproduce. There is no specific scenario. We know that, if we perform some minor activity with the application it might crash once a day.
The specific thread that's crashing wakes up every 5 milliseconds, reads information from one shared memory area and updates another. That's it.
There is not too much activity. The process is not working too hard.
Now: When I open the core and load a symbol-less image of the application, gdb points to instruction 100661e0. Instruction 100661e0 looks like this (viewing with an objdump view of the non-stripped image):
void class::foo(uint8_t factor)
{
100661d8: ffbf0018 sd ra,24(sp)
100661dc: 0080802d move s0,a0
bar(factor, shared_memory_a->profiles[PROFILE_1]);
100661e0: 0c019852 jal 10066148 <_ZN28class35barEhR30profile_t>
100661e4: 64c673c8 daddiu a2,a2,29640
bar(factor, shared_memory_a->profiles[PROFILE_2]);
100661e8: de060010 ld a2,16(s0)
The line that shows as the exception line is
100661e0: 0c019852 jal 10066148 <_ZN28class35barEhR30profile_t>
Note that 10066148 is a valid instruction.
The Bad register contains the following address, which is aligned, but does not look valid as far as the instruction address space is concerned: c0000003ddc5dd90
The cause register contains this value: 0000000000800014
I don't understand why the Bad registers shows the value it does, where the instruction specifically states a valid instruction. I am a little concerned about branch delay slot issues, but I shouldn't be concerned about those when I am running a userland application using simple C++.
I'd appreciate any thoughts.
Related
I meet this problem when finishing the lab of my OS course. We are trying to implement a kernel with the function of system call (platform: QEMU/i386).
When testing the kernel, problem occurred that after kernel load user program to memory and change the CPU state from kernel mode to user mode using 'iret' instruction, CPU works in a strange way as following.
%EIP register increased by 2 each time no matter how long the current instrution is.
no instruction seems to be execute, for no other registers change meantime.
Your guest has probably ended up executing a block of zeroed out memory. In i386, zeroed memory disassembles to a succession of "add BYTE PTR [rax],al" instructions, each of which is two bytes long (0x00 0x00), and if rax happens to point to memory which reads as zeroes, this will effectively be a 2-byte-insn no-op, which corresponds to what you are seeing. This might happen because you set up the iret incorrectly and it isn't returning to the address you expected, or because you've got the MMU setup wrong and the userspace program isn't in the memory where you expect it to be, for instance.
You could confirm this theory using QEMU's debug options (eg -d in_asm,cpu,exec,int,unimp,guest_errors -D qemu.log will log a lot of execution information to a file), which should (among a lot of other data) show you what instructions it is actually executing.
I’m trying to use QEMU to emulate a piece of firmware, but I’m having trouble getting the UART device to properly update the Line Status Register and display the input character.
Details:
Target device: Qualcomm QCA9533 (Documentation here if you're curious)
Target firmware: VxWorks 6.6 with U-Boot bootload
CPU: MIPS 24Kc
Board: mipssim (modified)
Memory: 512MB
Command used: qemu-system-mips -S -s -cpu 24Kc -M mipssim –nographic -device loader,addr=0xBF000000,cpu-num=0 -serial /dev/ttyS0 -bios target_image.bin
I have to apologize here, but I am unable to share my source. However, as I am attempting to retool the mipssim board, I have only made minor changes to the code, which are as follows:
Rebased bios memory region to 0x1F000000
Changed load_image_targphys() target address to 0x1F000000
Changed $pc initial value to 0xBF000000 (TLB remap of 0x1F000000)
Replaced the mipssim serial_init() ¬call with serial_mm_init(isa, 0x20000, env->irq[0], 115200, serial_hd(0), DEVICE_NATIVE_ENDIAN).
While it seems like serial_init() is probably the currently accepted standard, I wasn’t having any luck with remapping it. I noticed the malta board had no issues outputting on a MIPS test kernel I gave it, so I tried to mimic what was done there. However, I still cannot understand how QEMU works and I am unable to find many good resources that explain it. My slog through the source and included docs is ongoing, but in the meantime I was hoping someone might have some insight into what I’m doing wrong.
The binary loads and executes correctly from address 0xBF000000, but hangs when it hits the first UART polling loop. A look at mtree in the QEMU monitor shows that the I/O device is mapped correctly to address range 0x18020000-0x1802003F, and when the firmware writes to the Tx buffer, gdb shows the character successfully is written to memory. There’s just no further action from the serial device to pull that character and display it, so the firmware endlessly polls on the LSR waiting for an update.
Is there something I’m missing when it comes to serial/hardware interaction in QEMU? I would have assumed that remapping all of the existing functional components of the mipssim board would be enough to at least get serial communication working, especially since the target uses the same 16550 UART that mipssim does. Please let me know if you have any insights. It would be helpful if I could find a way to debug QEMU itself with symbols, but at the same time I’m not totally sure what I’d be looking for. Even advice on how to scope down the issue would be useful.
Thank you!
Well after a lot of hard work I got the UART working. The answer to the question lies within the serial_ioport_read() and serial_ioport_write() functions. These two methods are assigned as the callbacks QEMU invokes when data is read or written to the MemoryRegion for the serial device (which is initialized in serial_init() or serial_mm_init()). These functions do a bit of masking on the address (passed into the functions as addr) to determine which register is being referenced, then return the value from the SerialState struct corresponding to that register. It's surprisingly simple, but I guess everything seems simple once you've figured it out. The big turning point was the realization that QEMU effectively implements the serial device as a MemoryRegion with special functionality that is triggered on a memory operation.
Anyway, hope this helps someone in the future avoid the nightmare I went through. Cheers!
I have recently started coding in verilog. I have completed my first project, prototyping a MIPS 32 processor using 5 stage pipelining. Now my next task is to implement a single level cache hiearchy on the instruction set memory.
I have sucessfully implemented a 2-way set associative cache.
Previously I had declared the instruction set memory as a array of registers, so whenever I need to access the next instruction in IF stage, the data(instruction) gets instantaneously allotted to the register for further decoding (since blocking/non_blocking assignment is instantaneous from any memory location).
But now since I have a single level cache added on top of it, it takes a few more cycles for the cache FSM to work (like data searching, and replacement policies in case of cache miss). Max. delay is about 5 cycles when there is a cache miss.
Since my pipelined stage proceeds to the next stage within just a single cycle, hence whenever there is a cache miss, the cache fails to deliver the instruction before the pipeline stage moves to the next stage. So desired output is always wrong.
To counteract this , I have increased the clock of the cache by 5 times as compared the processor pipelined clock. This does do the work, since the cache clock is much faster, it need not to worry about the processor clock.
But is this workaround legit?? I mean i haven't heard of multiple clocks in a processor system. How does the processors in real world overcome this issue.
Yes ofc, there is an another way of using stall cycles in pipeline until the data is readily made available in cache (hit). But just wondering is making memory system more faster by increasing clock is justified??
P.S. I am newbie to computer architecture and verilog. I dont know about VLSI much. This is my first question ever, because whatever questions strikes, i get it readily available in webpages, but i cant find much details about this problem, so i am here.
I also asked my professor, she replied me to research more in this topic, bcs none of my colleague/ senior worked much on pipelined processors.
But is this workaround legit??
No, it isn't :P You're not only increasing the cache clock, but also apparently the memory clock. And if you can run your cache 5x faster and still make the timing constraints, that means you should clock your whole CPU 5x faster if you're aiming for max performance.
A classic 5-stage RISC pipeline assumes and is designed around single-cycle latency for cache hits (and simultaneous data and instruction cache access), but stalls on cache misses. (Data load/store address calculation happens in EX, and cache access in MEM, which is why that stage exists)
A stall is logically equivalent to inserting a NOP, so you can do that on cache miss. The program counter needs to not increment, but otherwise it should be a pretty local change.
If you had hardware performance counters, you'd maybe want to distinguish between real instructions vs. fake stall NOPs so you could count real instructions executed.
You'll need to implement pipeline interlocks for other stages that stall to wait for their inputs to be ready, e.g. a cache-miss load followed by an add that uses the result.
MIPS I had load-delay slots (you can't use the result of a load in the following instruction, because the MEM stage is after EX). So that ISA rule hides the 1 cycle latency of a cache hit without requiring the HW to detect the dependency and stall for it.
But a cache miss still had to be detected. Probably it stalled the whole pipeline whether there was a dependency or not. (Again, like inserting a NOP for the rest of the pipeline while holding on to the incoming instruction. Except this isn't the first stage, so it has to signal to the previous stage that it's stalling.)
Later versions of MIPS removed the load delay slot to avoid bloating code with NOPs when compilers couldn't fill the slot. Simple HW then had to detect the dependency and stall if needed, but smarter hardware probably tracked loads anyway so they could do hit under miss and so on. Not stalling the pipeline until an instruction actually tried to read a load result that wasn't ready.
MIPS = "Microprocessor without Interlocked Pipeline Stages" (i.e. no data-hazard detection). But it still had to stall for cache misses.
An alternate expansion for the acronym (which still fits MIPS II where the load delay slot as removed, requiring HW interlocks to detect that data hazard) would be "Minimally Interlocked Pipeline Stages" but apparently I made that up in my head, thanks #PaulClayton for catching that.
I'm having a bit of problems understanding how or if its possible to share a work load between a gpu and cpu. I have a large log file that I need to read each line then run about 5 million operations on(testing for various scenarios). My current approach has been to read a few hundred lines, add it to an array and then send it to each GPU, which is working fine but because there is so much work per line and so many lines it takes a long time. I noticed that while this is going on my CPU cores are basically doing nothing. I'm using EC2, so I have 2 quad core Xeon & 2 Tesla GPUs, one cpu core reads the file(running the main program) and the GPU's do the work so I'm wondering how or what can I do to involve the other 7 cores into the process?
I'm a bit confused at how to design a program to balance the tasks between GPU/CPU because they both would finish the jobs at different times so I couldn't just send it to them all at the same time. I thought about setting up a queue(I'm new to c, so not sure if this is possible yet) but then is there a way to know when a GPU job is completed(since I thought sending jobs to Cuda was asynchronous)? I kernel is very similar to a normal c function so converting it for cpu usage is not problem just balancing the work seems to be the issue. I went though 'Cuda by example' again but couldn't really find anything referring to this type of balancing.
Any suggestions would be great.
I think the key is to create a multithreaded app, following all the common practices for that, and have two types of worker threads. One that does work with the GPU and one that does work with the CPU. So basically, you will need a thread pool and a queue.
http://en.wikipedia.org/wiki/Thread_pool_pattern
The queue can be very simple. You can have one shared integer that is the index of the current row in the log file. When a thread is ready to retrieve more work, it locks that index, gets some number of lines from the log file, starting at the line designated by the index, then increases the index by the number of lines that it retrieved, and then unlocks.
When a worker thread is done with one chunk of the log file, it posts its results back to the main thread and gets another chunk (or exits if there are no more lines to process).
The app launches some combination of GPU and CPU worker threads to utilize all available GPUs and CPU cores.
One problem you may run into is that if the CPU is busy, performance of the GPUs may suffer, as slight delays in submitting new work or processing results from the GPUs are introduced. You may need to experiment with the number of threads and their affinity. For instance, you may need to reserve one CPU core for each GPU by manipulating thread affinities.
Since you say line-by-line may be you can split the jobs across 2 different process -
One CPU + GPU Process
One CPU process that utilized remaining 7 cores
You can start of each process with different offsets - like 1st process reads the lines 1-50, 101-150 etc while the 2nd one reads 51-100, 151-200 etc
This will avoid you the headache of optimizing CPU-GPU interaction
I have a VxWorks application running on ARM uC.
First let me summarize the application;
Application consists of a 3rd party stack and a gateway application.
We have implemented an operating system abstraction layer to support OS in-dependency.
The underlying stack has its own memory management&control facility which holds memory blocks in a doubly linked list.
For instance ; we don't directly perform malloc/new , free/delege .Instead we call OSA layer's routines and it gets the memory from OS and puts it in a list then returns this memory to application.(routines : XXAlloc , XXFree,XXReAlloc)
And when freeing the memory we again use XXFree.
In fact this block is a struct which has
-magic numbers indication the beginning and end of memory block
-size that user requested allocated
-size in reality due to alignment issue previous and next pointers
-pointer to piece of memory given back to application. link register that shows where in the application xxAlloc is called.
With this block structure stack can check if a block is corrupted or not.
Also we have pthread library which is ported from Linux that we use to
-create/terminate threads(currently there are 22 threads)
-synchronization objects(events,mutexes..)
There is main task called by taskSpawn and later this task created other threads.
this was a description of application and its VxWorks interface.
The problem is :
one of tasks suddenly gets destroyed by VxWorks giving no information about what's wrong.
I also have a jtag debugger and it hits the VxWorks taskDestoy() routine but call stack doesn't give any information neither PC or r14.
I'm suspicious of specific routine in code where huge xxAlloc is done but problem occurs
very sporadic giving no clue that I can map it to source code.
I think OS detects and exception and performs its handling silently.
any help would be great
regards
It resolved.
I did an isolated test. Allocated 20MB with malloc and memset with 0x55 and stopped thread of my application.
And I wrote another thread which checks my 20MB if any data else than 0x55 is written.
And quess what!! some other thread which belongs other components in CPU (someone else developed them) write my allocated space.
Thanks 4 your help
If your task exits, taskDestroy() is called. If you are suspicious of huge xxAlloc, verify that the allocation code is not calling exit() when memory is exhausted. I've been bitten by this behavior in a third party OSAL before.
Sounds like you are debugging after integration; this can be a hell of a job.
I suggest breaking the problem into smaller pieces.
Process
1) you can get more insight by instrumenting the code and/or using VxWorks intrumentation (depending on which version). This allows you to get more visibility in what happens. Be sure to log everything to a file, so you move back in time from the point where the task ends. Instrumentation is a worthwile investment as it will be handy in more occasions. Interesting hooks in VxWorks: Taskhooklib
2) memory allocation/deallocation is very fundamental functionality. It would be my first candidate for thorough (unit) testing in a well-defined multi-thread environment. If you have done this and no errors are found, I'd first start to look why the tas has ended.
other possible causes
A task will also end when the work is done.. so it may be a return caused by a not-so-endless loop. Especially if it is always the same task, this would be my guess.
And some versions of VxWorks have MMU support which must be considered.