How to diag imprecise bus fault after config of priority bit allocation, Cortex M3 STM32F10x w uC/OS-III - exception

I have an issue in an app written for the ST Microelectronics STM32F103 (ARM Cortex-M3 r1p1). RTOS is uC/OS-III; dev environment is IAR EWARM v. 6.44; it also uses the ST Standard Peripheral Library v. 1.0.1.
The app is not new; it's been in development and in the field for at least a year. It makes use of two UARTs, I2C, and one or two timers. Recently I decided to review interrupt priority assignments, and I rearranged priorities as part of the review (things seemed to work just fine).
I discovered that there was no explicit allocation of group and sub-priority bits in the initialization code, including the RTOS, and so to make the app consistent with another app (same product, different processor) and with the new priority scheme, I added a call to NVIC_PriorityGroupConfig(), passing in NVIC_PriorityGroup_2. This sets the PRIGROUP value in the Application Interrupt and Reset Control Register (AIRCR) to 5, allocating 2 bits for group (preemption) priority and 2 bits for subpriority.
After doing this, I get an imprecise bus fault exception on execution, not immediately but very quickly thereafter. (More on where I suspect it occurs in a moment.) Since it's imprecise (BFSR.IMPRECISERR asserted), there's nothing of use in BFAR (BFSR.BFARVALID clear).
The STM32F family group implements 4 bits of priority. While I've not found this mentioned explicitly anywhere, it's apparently the most significant nybble of the priority. This assumption seems to be validated by the PRIGROUP table given in documentation (p. 134, STM32F10xxx/20xxx/21xxx/L1xxx Cortex-M3 Programming Manual (Doc 15491, Rev 5), sec. 4.4.5, Application interrupt and control register (SCB_AIRCR), Table 45, Priority grouping, p. 134).
In the ARM scheme, priority values comprise some number of group or preemption priority bits and some number of subpriority bits. Group priority bits are upper bits; subpriority are lower. The 3-bit AIRCR.PRIGROUP value controls how bit allocation for each are defined. PRIGROUP = 0 configures 7 bits of group priority and 1 bit of subpriority; PRIGROUP = 7 configures 0 bits of group priority and 8 bits of subpriority (thus priorities are all subpriority, and no preemption occurs for exceptions with settable priorities).
The reset value of AIRCR.PRIGROUP is defined to be 0.
For the STM32F10x, since only the upper 4 bits are implemented, it seems to follow that PRIGROUP = 0, 1, 2, 3 should all be equivalent, since they all correspond to >= 4 bits of group priority.
Given that assumption, I also tried calling NVIC_PriorityGroupConfig() with a value of NVIC_PriorityGroup_4, which corresponds to a PRIGROUP value of 3 (4 bits group priority, no subpriority).
This change also results in the bus fault exception.
Unfortunately, the STM32F103 is, I believe, r1p1, and so does not implement the Auxiliary Control Register (ACTLR; introduced in r2p0), so I can't try out the DISDEFWBUF bit (disables use of the write buffer during default memory map accesses, making all bus faults precise at the expense of some performance reduction).
I'm almost certain that the bus fault occurs in an ISR, and most likely in a UART ISR. I've set a breakpoint at a particular place in code, started the app, and had the bus fault before execution hit the breakpoint; however, if I step through code in the debugger, I can get to and past that breakpoint's location, and if I allow it to execute from there, I'll see the bus fault some small amount of time after I continue.
The next step will be to attempt to pin down what ISR is generating the bus fault, so that I can instrument it and/or attempt to catch its invocation and step through it.
So my questions are:
1) Anyone have any suggestions as to how to go about identifying the origin of imprecise bus fault exceptions more intelligently?
2) Why would setting PRIGROUP = 3 change the behavior of the system when PRIGROUP = 0 is the reset default? (PRIGROUP=0 means 7 bits group, 1 bit sub priority; PRIGROUP=3 means 4 bits group, 4 bits sub priority; STM32F10x only implements upper 4 bits of priority.)
Many, many thanks to everyone in advance for any insight or non-NULL pointers!
(And of course if I figure it out beforehand, I'll update this post with any information that might be useful to others encountering the same sort of scenario.)

Even if BFAR is not valid, you can still read other related registers within your bus-fault ISR:
void HardFault_Handler_C(unsigned int* hardfault_args)
{
printf("R0 = 0x%.8X\r\n",hardfault_args[0]);
printf("R1 = 0x%.8X\r\n",hardfault_args[1]);
printf("R2 = 0x%.8X\r\n",hardfault_args[2]);
printf("R3 = 0x%.8X\r\n",hardfault_args[3]);
printf("R12 = 0x%.8X\r\n",hardfault_args[4]);
printf("LR = 0x%.8X\r\n",hardfault_args[5]);
printf("PC = 0x%.8X\r\n",hardfault_args[6]);
printf("PSR = 0x%.8X\r\n",hardfault_args[7]);
printf("BFAR = 0x%.8X\r\n",*(unsigned int*)0xE000ED38);
printf("CFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED28);
printf("HFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED2C);
printf("DFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED30);
printf("AFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED3C);
printf("SHCSR = 0x%.8X\r\n",SCB->SHCSR);
while (1);
}
If you can't use printf at the point in the execution when this specific Hard-Fault interrupt occurs, then save all the above data in a global buffer instead, so you can view it after reaching the while (1).
Here is the complete description of how to connect this ISR to the interrupt vector (although, as I understand from your question, you already have it implemented):
Jumping from one firmware to another in MCU internal FLASH
You might be able to find additional information on top of what you already know at:
http://www.keil.com/appnotes/files/apnt209.pdf

Related

Do __shfl_xx_sync() intrinsics with mask need an additional __syncwarp()?

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough?
I cannot provide a working minimal example, as it is very long and confidential code and the error appeared only in certain run/build configurations.
The code looks basically like the following:
if (threadIdx.x >= 30) {
temp.x = __shfl_up_sync(0xC0000000, x, 1);
temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);
Release builds worked fine; with debug builds lanes 30 and 31 waited (according to debugger and SASS) at a different sync instruction than the other lanes.
When I introduced __syncwarp() also debug builds run through. And I hope this problem is definitely fixed!?!
I am using a mask in the first two shuffle instructions indicating that only lanes 30 and 31 participate. What happens, if the scheduler decides that lanes 0 to 29 are executed first and goes to the second shuffle instruction (with all participating lanes)? Then those shuffle instructions wait for the lanes 30 and 31. Those threads then get to the upper shuffle instructions. Can the shuffles be distinguished?
If the __syncwarp() is needed: Why would it react differently then the shuffle instruction with mask 0xffffffff itself?
Because it is of other type? (shuffle sync instead of normal sync?) Or was it by accident that the program worked in this way?
(The __syncwarp() intrinsic is probably useful here anyway (for performance reasons), as the threads converge at that point.)
If __syncwarp() is not enough: How to make sure the kernel does not hang? Is there generally another recommended way than __syncwarp()?
I am running this on Turing RTX 2060 Mobile (and debugged is with Visual Studio).
No, you should not need a __syncwarp() here. CUDA went from e.g. __shfl_up() to __shfl_up_sync() to avoid this. I think the problem is that you are trying to shuffle up data from a thread that is not participating in the call, i.e. thread 30 is trying to get data from thread 29, so thread 29 has to participate.
Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.
from the docs. Although this explanation is still unsatisfactory, as you seem to get a deadlock instead of an undefined value. But maybe this is wanted behavior for a debug build?
That being said, I'm not quite sure how to do this elegantly, because just including thread 29 in the conditional and mask will only shift the problem to 29 trying to get data from 28. In the examples given in the documentation, they always do the intrinsic with all threads and then conditionally use the results.
My best guess is that you want thread 29 to participate, but with a delta of 0. I have not found anything saying that delta needs to be the same across threads.
You might also want to use __ballot_sync() to retrieve a mask as can be seen in Listing 3 of this blogpost to avoid bugs from manually specifying the mask, which needs to be changed whenever the conditional is changed.

Looking for a _one byte_ invalid opcode with x86

I need an invalid opcode with x86 (not x64!) that's exactly one byte in length to overwrite some code in a foreign process. Currently I'm using INT3 (0xCC) but it would be nicer to trap an invalid opcode separately since the foreign process contains a lot of valid INT3.
According to http://ref.x86asm.net/coder32.html, there aren't any in 32-bit mode guaranteed to #UD. Anything that wasn't nailed down has been reused as building material for new extensions.
The ones that exist in 64-bit mode are reserved and not guaranteed to fault on future CPUs; only ud2 is truly guaranteed future-proof. Assuming x86-64 lasts long enough, likely some vendor will make use of that 64-bit-only coding space and stop wasting code-size to also cater to increasingly obsolete 32-bit mode.
If you don't need #UD, you can raise #GP(0) with some privileged instructions in user-space, assuming you're never going to be running in kernel mode.
F4 hlt will always #GP(0) in user-space, not enabled by IOPL, only true CPL=0. (Or #UD if used with a lock prefix). Even if it somehow gets executed in a kernel context, it just stops and waits for the next interrupt, so typically no effect on correctness unless executed with interrupts disabled. (In which case you're stuck until the next NMI).
A similar but worse option is FB sti. But it can execute successfully in a program that's used Linux iopl(), like an X11 server. Unless interrupts were supposed to be disabled, though, that's still not going to lock up your system, it just won't trigger the exception you were looking for. (Unlike cli which could get that CPU stuck, or in al, dx which could do wild IO and even be allowed by ioperm not just iopl, depending on what value is in DX.)
Depending what comes next in memory, 9A callf ptr16:32 might fault on trying to load an invalid value into CS. That value would come from the 2 bytes of machine code 5 and 6 bytes after this one (i.e. after a 32-bit new EIP, since ptr16:32 is stored little-endian). Unlike call rel32 or whatever, it may fault before actually pushing anything and overwriting the current CS:EIP. (But if not, in theory your debugger could simulate popping that far-return address back into CS:EIP after catching the fault.)
Just to be clear, I'm suggesting overwriting a byte with 9A, and leaving the later bytes of machine code unmodified, after checking that the bytes that would be the new CS value are in fact invalid. e.g. by making sure a far call to that address segfaults. Or if this is near the end of a page, and the next is unmapped, it can #PF.
The F0 lock prefix faults with #UD if used on things other than a memory-destination RMW operation, so it can also work if later context would decode as any other instruction. But you can't always use it; you need to check that you aren't creating a valid atomic RMW instruction. e.g. if the ModRM byte was 00 or 01, replacing the opcode with a lock prefix creates a memory-destination add.
#ecm points out that f1 on some CPUs is icebp / int1, but on other CPUs where it isn't, it's undefined but doesn't raise #UD. (http://ref.x86asm.net/coder32.html#xF1)
If the following byte is 0, D4 00 aam 0 is guaranteed to #DE (divide exception). But any other value does immediate 8-bit division of AL.
Depending what byte comes next, CD int n can be used. But not for all following bytes, e.g. int 0x80 won't fault under Linux (unless your kernel is built without CONFIG_IA32_EMULATION). And you might not want some of the other random interrupt numbers. e.g. CD 03 int 3 is pretty much like CC int3.

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

Monte Carlo with rates, system simulation with CUDA C++

So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, I’m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information