TXM_MODULE_MANAGER_16_MPU for STMEZH7 - threadx

According to the application note AN4838 page 12, STMEZH7 has only 8 regions.
However, there is the following statement in the project description of STM32H747I:
The TXM_MODULE_MANAGER_16_MPU is a Preprocessor define that should be
added in both C and Assembly preprocessor define list to allow the
application on the stm32H7xx family to work properly.
I searched for the symbole to see if it really refers to the MPU configuration of 16 regions instead of 8, and I found the following in the tx_thread_schedule:
config_mpu:
LDM r0!,{r2-r9} // Load MPU regions 0-3
STM r1,{r2-r9} // Store MPU regions 0-3
LDM r0!,{r2-r9} // Load MPU regions 4-7
STM r1,{r2-r9} // Store MPU regions 4-7
#ifdef TXM_MODULE_MANAGER_16_MPU
LDM r0!,{r2-r9} // Load MPU regions 8-11
STM r1,{r2-r9} // Store MPU regions 8-11
// Regions 12-15 are reserved for the user to define.
LDM r0,{r2-r9} // Load MPU regions 12-15
STM r1,{r2-r9} // Store MPU regions 12-15
#endif
I have tried to remove the symbol for a project with STM32H735, but it gives unexpected behaviour. Does this means the AN is wrong and we can configure 16 regions and which also means we can configure 128 (16*8) subregions ?
EDIT: after the response from the linked manual P254,
Removing the TXM_MODULE_MANAGER_16_MPU to indicate it has 16 regions results in unexpected behavior,
the MPU_TYPE.DREGION= 0x10 which also indicates 16 regions,

The app note is unclear
Edit: The STM32H7xx chips should all have 8 region MPUs according to the programming guide https://www.st.com/resource/en/programming_manual/pm0253-stm32f7-series-and-stm32h7-series-cortexm7-processor-programming-manual-stmicroelectronics.pdf

Related

8051 External /WR signal not generated when UART bootloader is used after reset

Goal of the project: To connect an LCD with 8051 as an external memory-mapped I/O device
Problem: Given the following details, my 8051 controller just does not generate an external RD/WR command as required for the rest of the code to work.
Previous work: I used 8051 port 3 pins to generate EN, R/W and RS signals and got it to work. Therefore, I know that my command sequence is working fine. However, this was a really inefficient way of using the LCD because the enable pulse was generated by setting and resetting a port pin. I wish to connect the LCD using the external WR/RD signals and mapping it as a memory-mapped IO device. I have worked through the timing diagrams and the overall block diagram is attached here. As you can see (in the block diagram), the R/W line of LCD is activated using the most significant 6 pins of port 2 so that the LCD gets activated only at the right memory addresses. This operation (implemented in an SPLD) also serves to ensure the delay required at the LCD to ensure the minimum setup time after Port 2 pins 0,1 are used to set inputs at R/W and RS signals of LCD.
Additional hardware info: I have attached a spice diagram to show how the rest of my 8051 is connected. The one thing that is not included there is this: "I use a momentary pushbutton and pull-down resistor for /PSEN, and hold that button when coming out of reset in order to force bootloader operation; then, after the bootloader has started, I release that button to eliminate drive fight issues on the /PSEN line. I use a header/jumper for the /EA input to ensure it is high. Note that if you use these hardware conditions to enter the bootloader when you come out of reset, then the Atmel bootloader is entered regardless of the values of BLJB, BSB, and SBV."
Software used: I am using the paulmon2 to test my code. Programming is done using Flip utility: Flip 3.4.7 through the serial port. A serial emulator program (TeraTerm) is used to communicate with the microcontroller. The microcontroller first executes the paulmon code as well as its extra commands that have been programmed into it before the current user code at 0x2000 location. An extra command allows the user to jump to this code using 'J' command and then giving the address for memory: 0x2000. This calls the current
program and executes it. This is where my code resides and executes from.
The addresses used to map LCD are the following:
LCD_INSTR_WR: 0xA8FF ---> Used to write commands to LCD controller.
This includes all initialization and setup and management commands.
LCD_INSTR_RD: 0xA9FF ---> Used to read command. Done only to read the busy
flag or the current address counter. This is valid only for a single
instruction on the LCD.
LCD_DATA_WR: 0xAAFF ---> Used to write Data to the current address which has
been set either in DDRAM or CGRAM given the LCD_INSTR_WR above.
LCD_DATA_RD: 0xABFF ----> Used to read Data from the current address which
has been set either in DDRAM or CGRAM given the LCD_INSTR_WR above.
The code snippet I write in C to write the external memory:
//Global variables
volatile unsigned char xdata *LCD_INSTR_WR = (char xdata *) 0xA8FF;
volatile unsigned char xdata *LCD_INSTR_RD = (char xdata *) 0xA9FF;
volatile unsigned char xdata *LCD_DATA_WR = (char xdata *) 0xAAFF;
volatile unsigned char xdata *LCD_DATA_RD = (char xdata *) 0xABFF;
/// More code
//Write command example
lcdbusywait();
* LCD_DATA_WR = cc;
Earlier tests one to figure out the problem:
I have tried writing to the memory locations above 2000 using the paulmon memory edit instructions and they write the memory locations alright. Even /WR command is generated in this case as observed (but I have not properly measured/counted the accesses and /WR edge changes.
I have used the logic analyser to confirm that the address (and consequently RS and RW) and data (0x30H command in the beginning) are coming to the ports as expected. ALE is being generated.
I have verified that AUXR register bit EXTRAM is set (AUXR = 0x0E). Also, since EXTRAM is set by default, I tried to remove my initialization code for AUXR completely and that didn’t work either.
I was not sure that the C code that I have written for the XRAM address accesses is correct. However, I went on to check the .asm file and (unless I am neglecting something very minute), the assembly code generated does assign a 0x30h value as immediate data to a register A and uses a “MOVX #dptr,A” instruction to write this value to external memory.
Finally, this is my first post at Stack overflow so the formatting may be off and I do realize this is an extremely long post. Apologies for that. Let me know if you need to see the code files or the compiled hex file or other details. All your help is deeply appreciated.
volatile unsigned char xdata *LCD_INSTR_WR = (char xdata *) 0xA8FF;
I guess LCD_INSTR_WR should have address0xA8FF value.
You can try by using
#define LCD_INSTR_WR XBYTE[0xA8FF]
and then
LCD_INSTR_WR = cc;
or
char xdata LCD_INSTR_WR _at_ 0xA8FF; //This will declare the LCD_INSTR_WR at location 0xA8FF-
You may need to look into the data sheet of the micro controller how to configure extrnal memory
LCD_INSTR_WR = cc;

How to diag imprecise bus fault after config of priority bit allocation, Cortex M3 STM32F10x w uC/OS-III

I have an issue in an app written for the ST Microelectronics STM32F103 (ARM Cortex-M3 r1p1). RTOS is uC/OS-III; dev environment is IAR EWARM v. 6.44; it also uses the ST Standard Peripheral Library v. 1.0.1.
The app is not new; it's been in development and in the field for at least a year. It makes use of two UARTs, I2C, and one or two timers. Recently I decided to review interrupt priority assignments, and I rearranged priorities as part of the review (things seemed to work just fine).
I discovered that there was no explicit allocation of group and sub-priority bits in the initialization code, including the RTOS, and so to make the app consistent with another app (same product, different processor) and with the new priority scheme, I added a call to NVIC_PriorityGroupConfig(), passing in NVIC_PriorityGroup_2. This sets the PRIGROUP value in the Application Interrupt and Reset Control Register (AIRCR) to 5, allocating 2 bits for group (preemption) priority and 2 bits for subpriority.
After doing this, I get an imprecise bus fault exception on execution, not immediately but very quickly thereafter. (More on where I suspect it occurs in a moment.) Since it's imprecise (BFSR.IMPRECISERR asserted), there's nothing of use in BFAR (BFSR.BFARVALID clear).
The STM32F family group implements 4 bits of priority. While I've not found this mentioned explicitly anywhere, it's apparently the most significant nybble of the priority. This assumption seems to be validated by the PRIGROUP table given in documentation (p. 134, STM32F10xxx/20xxx/21xxx/L1xxx Cortex-M3 Programming Manual (Doc 15491, Rev 5), sec. 4.4.5, Application interrupt and control register (SCB_AIRCR), Table 45, Priority grouping, p. 134).
In the ARM scheme, priority values comprise some number of group or preemption priority bits and some number of subpriority bits. Group priority bits are upper bits; subpriority are lower. The 3-bit AIRCR.PRIGROUP value controls how bit allocation for each are defined. PRIGROUP = 0 configures 7 bits of group priority and 1 bit of subpriority; PRIGROUP = 7 configures 0 bits of group priority and 8 bits of subpriority (thus priorities are all subpriority, and no preemption occurs for exceptions with settable priorities).
The reset value of AIRCR.PRIGROUP is defined to be 0.
For the STM32F10x, since only the upper 4 bits are implemented, it seems to follow that PRIGROUP = 0, 1, 2, 3 should all be equivalent, since they all correspond to >= 4 bits of group priority.
Given that assumption, I also tried calling NVIC_PriorityGroupConfig() with a value of NVIC_PriorityGroup_4, which corresponds to a PRIGROUP value of 3 (4 bits group priority, no subpriority).
This change also results in the bus fault exception.
Unfortunately, the STM32F103 is, I believe, r1p1, and so does not implement the Auxiliary Control Register (ACTLR; introduced in r2p0), so I can't try out the DISDEFWBUF bit (disables use of the write buffer during default memory map accesses, making all bus faults precise at the expense of some performance reduction).
I'm almost certain that the bus fault occurs in an ISR, and most likely in a UART ISR. I've set a breakpoint at a particular place in code, started the app, and had the bus fault before execution hit the breakpoint; however, if I step through code in the debugger, I can get to and past that breakpoint's location, and if I allow it to execute from there, I'll see the bus fault some small amount of time after I continue.
The next step will be to attempt to pin down what ISR is generating the bus fault, so that I can instrument it and/or attempt to catch its invocation and step through it.
So my questions are:
1) Anyone have any suggestions as to how to go about identifying the origin of imprecise bus fault exceptions more intelligently?
2) Why would setting PRIGROUP = 3 change the behavior of the system when PRIGROUP = 0 is the reset default? (PRIGROUP=0 means 7 bits group, 1 bit sub priority; PRIGROUP=3 means 4 bits group, 4 bits sub priority; STM32F10x only implements upper 4 bits of priority.)
Many, many thanks to everyone in advance for any insight or non-NULL pointers!
(And of course if I figure it out beforehand, I'll update this post with any information that might be useful to others encountering the same sort of scenario.)
Even if BFAR is not valid, you can still read other related registers within your bus-fault ISR:
void HardFault_Handler_C(unsigned int* hardfault_args)
{
printf("R0 = 0x%.8X\r\n",hardfault_args[0]);
printf("R1 = 0x%.8X\r\n",hardfault_args[1]);
printf("R2 = 0x%.8X\r\n",hardfault_args[2]);
printf("R3 = 0x%.8X\r\n",hardfault_args[3]);
printf("R12 = 0x%.8X\r\n",hardfault_args[4]);
printf("LR = 0x%.8X\r\n",hardfault_args[5]);
printf("PC = 0x%.8X\r\n",hardfault_args[6]);
printf("PSR = 0x%.8X\r\n",hardfault_args[7]);
printf("BFAR = 0x%.8X\r\n",*(unsigned int*)0xE000ED38);
printf("CFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED28);
printf("HFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED2C);
printf("DFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED30);
printf("AFSR = 0x%.8X\r\n",*(unsigned int*)0xE000ED3C);
printf("SHCSR = 0x%.8X\r\n",SCB->SHCSR);
while (1);
}
If you can't use printf at the point in the execution when this specific Hard-Fault interrupt occurs, then save all the above data in a global buffer instead, so you can view it after reaching the while (1).
Here is the complete description of how to connect this ISR to the interrupt vector (although, as I understand from your question, you already have it implemented):
Jumping from one firmware to another in MCU internal FLASH
You might be able to find additional information on top of what you already know at:
http://www.keil.com/appnotes/files/apnt209.pdf

Why doesn't RemoveAllCacheResponses completely clear the disk cache?

I have an application that uses the Google Map Javascript API v3 and a UIWebView to display a map onscreen. While on this map I can use the app to collect multiple points of GPS data to represent a line.
After collecting 1460-1480 points the app quits unexpectedly (pinch zooming on the map makes the app quit before the 1400+ threshold is reached). It appears to be a memory issue (my app is the blue wedge of the pie chart).
The map screen does receive multiple memory warnings that are handled in an overridden DidReceiveMemoryWarning method in this screen. There is some prior code that called NSUrlCache.SharedCache.RemoveAllCachedResponses.
public override void DidReceiveMemoryWarning()
{
// BEFORE
uint diskUsage = NSUrlCache.SharedCache.CurrentDiskUsage;
uint memUsage = NSUrlCache.SharedCache.CurrentMemoryUsage;
int points = _currentEntityManager.GeometryPointCount;
Console.WriteLine(string.Format("BEFORE - diskUsage = {0}, memUsage = {1}, points = {2}", diskUsage, memUsage, points));
NSUrlCache.SharedCache.RemoveAllCachedResponses();
// AFTER
diskUsage = NSUrlCache.SharedCache.CurrentDiskUsage;
memUsage = NSUrlCache.SharedCache.CurrentMemoryUsage;
points = _currentEntityManager.GeometryPointCount;
Console.WriteLine(string.Format("AFTER - diskUsage = {0}, memUsage = {1}, points = {2}", diskUsage, memUsage, points));
base.DidReceiveMemoryWarning();
}
I added the BEFORE and AFTER sections so I could track cache contents before and after RemoveAllCachedResponses is called.
The shared cache is configured when the application starts (prior to my working on this issue it was not being configured at all).
uint cacheSizeMemory = 1024 * 1024 * 4;
uint cacheSizeDisk = 1024 * 1024 * 32;
NSUrlCache sharedCache = new NSUrlCache(cacheSizeMemory, cacheSizeDisk, "");
NSUrlCache.SharedCache = sharedCache;
When we're on this screen collection point data and we receive a low memory warning, RemoveAllCachedResponses is called and the Before/After statistics are printed to the console. Here are the numbers for the first low memory warning we receive.
BEFORE - diskUsage = 2258864, memUsage = 605032, points = 1174
AFTER - diskUsage = 1531904, memUsage = 0, points = 1174
Which is what I would expect to happen - flushing the cache reduces disk and memory usage (though I would expect the the disk usage number to also go to zero).
All subsequent calls to RemoveAllCachedResponses display these statistics (this Before/After is immediately prior to the app crashing).
BEFORE - diskUsage = 1531904, memUsage = 0, points = 1471
AFTER - diskUsage = 1531904, memUsage = 0, points = 1471
This leads me to believe one of two things - 1. RemoveAllCachedResponses is not working (unlikely) or 2. there's something in the disk cache that can't be removed because it's currently in use, something like the current set of map tiles.
Regarding #2, I'd like to believe this, figuring the reduction in disk usage on the first call represents a set of tiles that were no longer being used because of a pinch zoom in, but no pinch zooming or panning at all was done on this map, i.e. only one set of initial tiles should have been downloaded and cached.
Also, we are loading the Google Map Javascript API file as local HTML so it could be that this file is what's remaining resident in the cache. But the file is only 18,192 bytes which doesn't jib with the 1,531,904 bytes remaining in the disk cache.
I should also mention that the Android version of this app (written with Xamarin.Android) has no such memory issue on its map screen - it is possible to collect 5500+ points without incident).
So why does the disk cache not go to zero when cleared?
Thanks in advance.
I am confused about this too. You can take a look at the question I asked.
removeAllCachedResponses can not clear sharedURLCache?
The comrade said "it has some additional overhead in the cache and it's not related to real network data."

Generating Uniform Double random numbers on device in CUDA

I would like to generate uniform random numbers on the device, to be used inside of a device function. Each thread should generate a different uniform random number. I have this code, but I get a segmentation fault.
int main{
curandStateMtgp32 *devMTGPStates;
mtgp32_kernel_params *devKernelParams;
cudaMalloc((void **)&devMTGPStates, NUM_THREADS*NUM_BLOCKS * sizeof(curandStateMtgp32));
cudaMalloc((void**)&devKernelParams,sizeof(mtgp32_kernel_params));
curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams);
curandMakeMTGP32KernelState(devMTGPStates,
mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
doHenry <<NUM_BLOCKS,NUM_THREADS>>> (devMTGPStates);
}
Inside my global function doHenry, evaluated on the device, I put:
double rand1 = curand_uniform_double(&state[threadIdx.x+NUM_THREADS*blockIdx.x]);
Is this the best way to generate a random number per thread? I don't understand what the devKernelParams is doing, but I know I need one state per thread, right?
I think you're getting the seg fault on this line:
curandMakeMTGP32KernelState(devMTGPStates, mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
I believe the reason for the seg fault is because you have exceeded 200 for the n parameter, for which you are passing NUM_BLOCKS*NUM_THREADS. I tried a version of your code, and I was able to reproduce the seg fault at around n=540.
The MT generator has a limitation on the amount of states it can set up when using pre-generated kernel parameters (mtgp32dc_params_fast_11213). You may wish to read the relevant section of the documentation. (Bit Generation with the MTGP32 generator)
I'm not really an expert on CURAND, but other generators (such as XORWOW) don't have this type of limitation, so if you want to generate a large amount of independent thread state easily, consider one of the other generators. Using the particular approach you have outlined, the MTGP32 generator seems to be limited to about 200*256 independent thread generation. Contrary to what I said in the comments (which is true for other generator types) the MTGP32 state seems to be sufficient at one state for a block of up to 256 threads. And the example given in the documentation (refer to the second example) uses that type of state generation and threadblock hierarchy.

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.