8051 External /WR signal not generated when UART bootloader is used after reset - external

Goal of the project: To connect an LCD with 8051 as an external memory-mapped I/O device
Problem: Given the following details, my 8051 controller just does not generate an external RD/WR command as required for the rest of the code to work.
Previous work: I used 8051 port 3 pins to generate EN, R/W and RS signals and got it to work. Therefore, I know that my command sequence is working fine. However, this was a really inefficient way of using the LCD because the enable pulse was generated by setting and resetting a port pin. I wish to connect the LCD using the external WR/RD signals and mapping it as a memory-mapped IO device. I have worked through the timing diagrams and the overall block diagram is attached here. As you can see (in the block diagram), the R/W line of LCD is activated using the most significant 6 pins of port 2 so that the LCD gets activated only at the right memory addresses. This operation (implemented in an SPLD) also serves to ensure the delay required at the LCD to ensure the minimum setup time after Port 2 pins 0,1 are used to set inputs at R/W and RS signals of LCD.
Additional hardware info: I have attached a spice diagram to show how the rest of my 8051 is connected. The one thing that is not included there is this: "I use a momentary pushbutton and pull-down resistor for /PSEN, and hold that button when coming out of reset in order to force bootloader operation; then, after the bootloader has started, I release that button to eliminate drive fight issues on the /PSEN line. I use a header/jumper for the /EA input to ensure it is high. Note that if you use these hardware conditions to enter the bootloader when you come out of reset, then the Atmel bootloader is entered regardless of the values of BLJB, BSB, and SBV."
Software used: I am using the paulmon2 to test my code. Programming is done using Flip utility: Flip 3.4.7 through the serial port. A serial emulator program (TeraTerm) is used to communicate with the microcontroller. The microcontroller first executes the paulmon code as well as its extra commands that have been programmed into it before the current user code at 0x2000 location. An extra command allows the user to jump to this code using 'J' command and then giving the address for memory: 0x2000. This calls the current
program and executes it. This is where my code resides and executes from.
The addresses used to map LCD are the following:
LCD_INSTR_WR: 0xA8FF ---> Used to write commands to LCD controller.
This includes all initialization and setup and management commands.
LCD_INSTR_RD: 0xA9FF ---> Used to read command. Done only to read the busy
flag or the current address counter. This is valid only for a single
instruction on the LCD.
LCD_DATA_WR: 0xAAFF ---> Used to write Data to the current address which has
been set either in DDRAM or CGRAM given the LCD_INSTR_WR above.
LCD_DATA_RD: 0xABFF ----> Used to read Data from the current address which
has been set either in DDRAM or CGRAM given the LCD_INSTR_WR above.
The code snippet I write in C to write the external memory:
//Global variables
volatile unsigned char xdata *LCD_INSTR_WR = (char xdata *) 0xA8FF;
volatile unsigned char xdata *LCD_INSTR_RD = (char xdata *) 0xA9FF;
volatile unsigned char xdata *LCD_DATA_WR = (char xdata *) 0xAAFF;
volatile unsigned char xdata *LCD_DATA_RD = (char xdata *) 0xABFF;
/// More code
//Write command example
lcdbusywait();
* LCD_DATA_WR = cc;
Earlier tests one to figure out the problem:
I have tried writing to the memory locations above 2000 using the paulmon memory edit instructions and they write the memory locations alright. Even /WR command is generated in this case as observed (but I have not properly measured/counted the accesses and /WR edge changes.
I have used the logic analyser to confirm that the address (and consequently RS and RW) and data (0x30H command in the beginning) are coming to the ports as expected. ALE is being generated.
I have verified that AUXR register bit EXTRAM is set (AUXR = 0x0E). Also, since EXTRAM is set by default, I tried to remove my initialization code for AUXR completely and that didn’t work either.
I was not sure that the C code that I have written for the XRAM address accesses is correct. However, I went on to check the .asm file and (unless I am neglecting something very minute), the assembly code generated does assign a 0x30h value as immediate data to a register A and uses a “MOVX #dptr,A” instruction to write this value to external memory.
Finally, this is my first post at Stack overflow so the formatting may be off and I do realize this is an extremely long post. Apologies for that. Let me know if you need to see the code files or the compiled hex file or other details. All your help is deeply appreciated.

volatile unsigned char xdata *LCD_INSTR_WR = (char xdata *) 0xA8FF;
I guess LCD_INSTR_WR should have address0xA8FF value.
You can try by using
#define LCD_INSTR_WR XBYTE[0xA8FF]
and then
LCD_INSTR_WR = cc;
or
char xdata LCD_INSTR_WR _at_ 0xA8FF; //This will declare the LCD_INSTR_WR at location 0xA8FF-
You may need to look into the data sheet of the micro controller how to configure extrnal memory
LCD_INSTR_WR = cc;

Related

How to declare a struct with a dynamic array inside it in device

How to declare a struct in device that a member of it, is an array and then dynamically allocated memory for this. for example in below code, compiler said: error : calling a __host__ function("malloc") from a __global__ function("kernel_ScoreMatrix") is not allowed. is there another way for perform this action?
Type ofdev_size_idx_threads is int* and value of it, sent to kernel and used for allocate memory.
struct struct_matrix
{
int *idx_threads_x;
int *idx_threads_y;
int thread_diag_length;
int idx_length;
};
struct struct_matrix matrix[BLOCK_SIZE_Y];
matrix->idx_threads_x= (int *) malloc ((*(dev_size_idx_threads) * sizeof(int) ));
From device code, dynamic memory allocations (malloc and new) are supported only with devices of cc2.0 and greater. If you have a cc2.0 device or greater, and you pass an appropriate flag to nvcc (such as -arch=sm_20) you should not see this error. Note that if you are passing multiple compilation targets (sm_10, sm_20, etc.), if even one of the targets does not meet the cc2.0+ requirement, you will see this error.
If you have a cc1.x device, you will need to perform these types of allocations from the host (e.g. using cudaMalloc) and pass appropriate pointers to your kernel.
If you choose that route (allocating from the host), you may also be interested in my answer to questions like this one.
EDIT: responding to questions below:
In visual studio (2008 express, should be similar for other versions), you can set the compilation target as follows: open project, select Project...Properties, select Configuration Properties...CUDA Runtime API...GPU Now, on the right hand pane, you will see entries like GPU Architecture (1) (and (2) etc.) These are drop-downs that you can click on and select the target(s) you want to compile for. If your GPU is sm_21 I would select that for (1) and leave the others blank, or select compatible versions like sm_20.
To see worked examples, please follow the link I gave above. A couple worked examples are linked from my answer here as well as a description of how it is done.

error in Assigning values to bytes in a 2d array of registers in Verilog .Error

Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!
Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

CUDA, low performance in storing data in shared memroy

Here is my problem, in order to speed up my project, i want to save a value which generated inside kernel into a shared memory, However, i found it takes such a long time to save that value. If i remove "THIS LINE"(see codes below), i.e., remove the "THIS LINE", it is very fast to save that value(100 times speed-up!).
extern __shared__ int sh_try[];
__global__ void xxxKernel (...)
{
float v, e0, e1;
float t;
int count(0);
for (...)
{
v = fetchTexture();
e0 = fetchTexture();
e1 = fetchTexture();
t = someDeviceFunction(v, e0, e1);
if (t>0.0 && t < 1.0) <========== <THIS LINE>
count++;
}
sh_try[threadIdx.x] = count;
}
main()
{
sth..
START TIMING:
xxxKernel<<<gridDim.x, BlockDim.x, BlockDim.x*sizeof(int)>>> (...);
cudaDeviceSynchronize();
END TIMING.
sth...
}
In order to figure out this problem, i simplify my codes that just save the data into the shared mem. and stop. As i know shared mem. is the most efficient mem. besides register, I wonder if this high latency is normal or i've done sth wrong. PLEASE give me some advice!!! Thank you guys in advance!!!
trudi
UPDATE:
If i replace shared memory with global mem., it takes almost the same time, 33ms without "THIS LINE", 297ms with it. Is that normal for saving data to global mem. takes the same time as shared mem.? Is that also a part of 'compiler optimization'?
I have checked the other similar problems on stackoverflow also, i.e., there is a huge time gap between saving data into shared memory or not, which may caused by compiler optimization, since it is pointless to calculating data but not saving them, so the compiler just 'removed' those pointless code.
I am not sure if i share the same reason, since the line changes the game is a hypothesis - "THIS LINE", when i comment it, the variable 'count' increases in EVERY iteration, when i uncomment it, it increases when the t is meaningful.
Any ideas? Please...
Frequently, when large performance changes are seen as a result of relatively small code changes (such as adding or deleting a line of code in a kernel), the performance changes are not due to the actual performance impact of that line of code, but are due to the compiler making different optimization decisions, which can result in wholesale additions or deletions of machine code in your kernels.
A relatively easy way to help confirm this is to look at the generated machine code. For example, if the size of the generated machine code changes substantially due to the addition or deletion of a single line of source code, it may be the case that the compiler made an optimization decision that drastically affected the code.
Although it's not machine code, for these purposes a reasonable proxy is to look at the generated PTX code, which is an intermediate code that the compiler creates.
You can generated ptx by simply adding the -ptx switch to your compile command:
nvcc -ptx mycode.cu
This will generate a file called mycode.ptx which you can inspect. Naturally if your regular compile command requires extra switches (e.g -I/path/to/include/files) then this command may require those same switches. The nvcc manual provides more information on code generation options, and there is a PTX manual to help you learn about PTX, but you may be able to get a rough idea just based on the size of the generated PTX (e.g. number of lines in the .ptx file).

VHDL MIPS 5 stage pipeline Bug

The code for this is too long to post so Ill just describe it. I've created a 5 stage mips pipe that almost works. The catch is that EVERY lw instruction that reaches the instruction decode stage overwrites the control signal values in the execution stage. Not only that it causes the PC to skip can instruction, i.e from 300 -> 308. I just need some idea on where to look for bugs since this is a class assignment. If we take out all the LW instructions the CPU works fine.
Example:
The adder in the EX stage is going to sub $4 $1 $2 which should be 1
Once LW enters the ID stage ALUsrc is asserted AND ALUop is changed from subtract to add
This forces the adder in the EX stage to add $4 $1 $2 resulting in 5 being stored in $4
http://en.wikipedia.org/wiki/File:MIPS_Architecture_%28Pipelined%29.svg
The MIPS 5 Stage Pipeline (annotated to show Write Reg Select and enable)
The bottom line through the pipeline stages represents the register file write (back) port address and write enable and WB is the data from memory.
http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html
Load Word Instruction
Description:
A word is loaded into a register from the specified address.
Operation: $t = MEM[$s + offset]; advance_pc (4);
Syntax: lw $t, offset($s)
Encoding:
1000 11ss ssst tttt iiii iiii iiii iiii
Where the write register address ($t) input is read from data memory address comprised of register file register $s offset with the immediate value i which gets sign extended. Your $4 is $t above, $1 or $2 is $s while the remaining register file output lane sounds to be suborned for the sign extended immediate.
From your description it sounds like you aren't using a three port register file with one port a write only port.
With a three port register file the only time you run into conflicts is when you attempt to use the new register file value from memory before it is read from memory and written to the register file. That can be managed by a compiler scheduling NOOPs until the outstanding register file write is retired when a following instruction is trying to use it, or stalling the IF/ID in hardware when it's output contains a reference to an outstanding register file write.
There are three instructions that can be in flight to the right of IF/ID, each with a write to register file address and a write enable. You'd need to compare both instruction decode register file addresses to all three of those and stall IF/ID until those clear out. The write enable stored in each of those three pipeline stages are used to determine whether the write register address in those pipeilne stages should be compared.
Because the ID/EX, EX/MEM and MEM/WB write register file addresses are not used anywhere else the circuitry for doing the comparison can be collocated with IF/ID and the Register File, preventing unnecessary layout delays affecting the minimum clock cycle.
Using a two port register file is much simpler and infers IF/ID stalling until the write enable comes back from MEM/WB, effectively turning any memory reading instructions into 3 cycle instructions (or more, data memory can stall if it's a cache or slow). It makes a three port register file more or less necessary for performance reasons. There's an implied multiplexer to source for at least one of the two register file port controls (write enable, write address) from the MEM/WB stage when IF/ID is stalled (for memory->regfile).
Data memory access can stall MEM/WB, just like instruction memory access can also stall IF/ID. A stalled IF/ID doesn't issue a write enable for the register file to ID/EX nor does a stalled MEM/WB.