Unable to generate line numbers in fpc debugging output - freepascal

I have a very simple test program (test.pas) as shown below and I'm trying to generate a memory trace but am unable to get any detailed output containing line numbers etc.
program test;
var
intPointer:^integer;
begin
new(intPointer); //Allocate some memory
intPointer^:=5;
// dispose(intPointer);
WriteLn('Hello World');
end.
I ran the following.
fpc -g -gh -gl test.pas; ./test
And this is the output I get.
Hello World
Heap dump by heaptrc unit
1 memory blocks allocated : 2/8
0 memory blocks freed : 0/0
1 unfreed memory blocks : 2
True heap size : 327680 (32 used in System startup)
True free heap : 327488
Should be : 327512
Call trace for block $00000001000CA0C0 size 2
In this toy example, I can tell that the intPointer was not disposed off, but for larger applications I was hoping for more insight. Other examples online seem to show the line number in the original file that allocated the memory, and I was wondering what I am doing incorrectly.
Any suggestions?
Edit:
Added another example (subsection 9.2) that I'm unable to get line number information for.
http://www.math.uni-leipzig.de/pool/tuts/FreePascal/units/node10.html

Sometimes lineinfo can't determine source file exactly, especially for the top call in stack trace. If you would like to see the file name, you need to move your code from the main begin/end statement into a procedure. It's not ideal solution, but it makes debugging a bit easier.
program test;
procedure PointerTest;
var
intPointer:^integer;
begin
new(intPointer); //Allocate some memory
intPointer^:=5;
// dispose(intPointer);
WriteLn('Hello World');
end;
begin
PointerTest;
end.
Also, it seems that now unit exeinfo (which is used by lineinfo) supports only ppc32 architecture on macOS.

Related

8051 External /WR signal not generated when UART bootloader is used after reset

Goal of the project: To connect an LCD with 8051 as an external memory-mapped I/O device
Problem: Given the following details, my 8051 controller just does not generate an external RD/WR command as required for the rest of the code to work.
Previous work: I used 8051 port 3 pins to generate EN, R/W and RS signals and got it to work. Therefore, I know that my command sequence is working fine. However, this was a really inefficient way of using the LCD because the enable pulse was generated by setting and resetting a port pin. I wish to connect the LCD using the external WR/RD signals and mapping it as a memory-mapped IO device. I have worked through the timing diagrams and the overall block diagram is attached here. As you can see (in the block diagram), the R/W line of LCD is activated using the most significant 6 pins of port 2 so that the LCD gets activated only at the right memory addresses. This operation (implemented in an SPLD) also serves to ensure the delay required at the LCD to ensure the minimum setup time after Port 2 pins 0,1 are used to set inputs at R/W and RS signals of LCD.
Additional hardware info: I have attached a spice diagram to show how the rest of my 8051 is connected. The one thing that is not included there is this: "I use a momentary pushbutton and pull-down resistor for /PSEN, and hold that button when coming out of reset in order to force bootloader operation; then, after the bootloader has started, I release that button to eliminate drive fight issues on the /PSEN line. I use a header/jumper for the /EA input to ensure it is high. Note that if you use these hardware conditions to enter the bootloader when you come out of reset, then the Atmel bootloader is entered regardless of the values of BLJB, BSB, and SBV."
Software used: I am using the paulmon2 to test my code. Programming is done using Flip utility: Flip 3.4.7 through the serial port. A serial emulator program (TeraTerm) is used to communicate with the microcontroller. The microcontroller first executes the paulmon code as well as its extra commands that have been programmed into it before the current user code at 0x2000 location. An extra command allows the user to jump to this code using 'J' command and then giving the address for memory: 0x2000. This calls the current
program and executes it. This is where my code resides and executes from.
The addresses used to map LCD are the following:
LCD_INSTR_WR: 0xA8FF ---> Used to write commands to LCD controller.
This includes all initialization and setup and management commands.
LCD_INSTR_RD: 0xA9FF ---> Used to read command. Done only to read the busy
flag or the current address counter. This is valid only for a single
instruction on the LCD.
LCD_DATA_WR: 0xAAFF ---> Used to write Data to the current address which has
been set either in DDRAM or CGRAM given the LCD_INSTR_WR above.
LCD_DATA_RD: 0xABFF ----> Used to read Data from the current address which
has been set either in DDRAM or CGRAM given the LCD_INSTR_WR above.
The code snippet I write in C to write the external memory:
//Global variables
volatile unsigned char xdata *LCD_INSTR_WR = (char xdata *) 0xA8FF;
volatile unsigned char xdata *LCD_INSTR_RD = (char xdata *) 0xA9FF;
volatile unsigned char xdata *LCD_DATA_WR = (char xdata *) 0xAAFF;
volatile unsigned char xdata *LCD_DATA_RD = (char xdata *) 0xABFF;
/// More code
//Write command example
lcdbusywait();
* LCD_DATA_WR = cc;
Earlier tests one to figure out the problem:
I have tried writing to the memory locations above 2000 using the paulmon memory edit instructions and they write the memory locations alright. Even /WR command is generated in this case as observed (but I have not properly measured/counted the accesses and /WR edge changes.
I have used the logic analyser to confirm that the address (and consequently RS and RW) and data (0x30H command in the beginning) are coming to the ports as expected. ALE is being generated.
I have verified that AUXR register bit EXTRAM is set (AUXR = 0x0E). Also, since EXTRAM is set by default, I tried to remove my initialization code for AUXR completely and that didn’t work either.
I was not sure that the C code that I have written for the XRAM address accesses is correct. However, I went on to check the .asm file and (unless I am neglecting something very minute), the assembly code generated does assign a 0x30h value as immediate data to a register A and uses a “MOVX #dptr,A” instruction to write this value to external memory.
Finally, this is my first post at Stack overflow so the formatting may be off and I do realize this is an extremely long post. Apologies for that. Let me know if you need to see the code files or the compiled hex file or other details. All your help is deeply appreciated.
volatile unsigned char xdata *LCD_INSTR_WR = (char xdata *) 0xA8FF;
I guess LCD_INSTR_WR should have address0xA8FF value.
You can try by using
#define LCD_INSTR_WR XBYTE[0xA8FF]
and then
LCD_INSTR_WR = cc;
or
char xdata LCD_INSTR_WR _at_ 0xA8FF; //This will declare the LCD_INSTR_WR at location 0xA8FF-
You may need to look into the data sheet of the micro controller how to configure extrnal memory
LCD_INSTR_WR = cc;

error in Assigning values to bytes in a 2d array of registers in Verilog .Error

Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!
Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.

CUDA, low performance in storing data in shared memroy

Here is my problem, in order to speed up my project, i want to save a value which generated inside kernel into a shared memory, However, i found it takes such a long time to save that value. If i remove "THIS LINE"(see codes below), i.e., remove the "THIS LINE", it is very fast to save that value(100 times speed-up!).
extern __shared__ int sh_try[];
__global__ void xxxKernel (...)
{
float v, e0, e1;
float t;
int count(0);
for (...)
{
v = fetchTexture();
e0 = fetchTexture();
e1 = fetchTexture();
t = someDeviceFunction(v, e0, e1);
if (t>0.0 && t < 1.0) <========== <THIS LINE>
count++;
}
sh_try[threadIdx.x] = count;
}
main()
{
sth..
START TIMING:
xxxKernel<<<gridDim.x, BlockDim.x, BlockDim.x*sizeof(int)>>> (...);
cudaDeviceSynchronize();
END TIMING.
sth...
}
In order to figure out this problem, i simplify my codes that just save the data into the shared mem. and stop. As i know shared mem. is the most efficient mem. besides register, I wonder if this high latency is normal or i've done sth wrong. PLEASE give me some advice!!! Thank you guys in advance!!!
trudi
UPDATE:
If i replace shared memory with global mem., it takes almost the same time, 33ms without "THIS LINE", 297ms with it. Is that normal for saving data to global mem. takes the same time as shared mem.? Is that also a part of 'compiler optimization'?
I have checked the other similar problems on stackoverflow also, i.e., there is a huge time gap between saving data into shared memory or not, which may caused by compiler optimization, since it is pointless to calculating data but not saving them, so the compiler just 'removed' those pointless code.
I am not sure if i share the same reason, since the line changes the game is a hypothesis - "THIS LINE", when i comment it, the variable 'count' increases in EVERY iteration, when i uncomment it, it increases when the t is meaningful.
Any ideas? Please...
Frequently, when large performance changes are seen as a result of relatively small code changes (such as adding or deleting a line of code in a kernel), the performance changes are not due to the actual performance impact of that line of code, but are due to the compiler making different optimization decisions, which can result in wholesale additions or deletions of machine code in your kernels.
A relatively easy way to help confirm this is to look at the generated machine code. For example, if the size of the generated machine code changes substantially due to the addition or deletion of a single line of source code, it may be the case that the compiler made an optimization decision that drastically affected the code.
Although it's not machine code, for these purposes a reasonable proxy is to look at the generated PTX code, which is an intermediate code that the compiler creates.
You can generated ptx by simply adding the -ptx switch to your compile command:
nvcc -ptx mycode.cu
This will generate a file called mycode.ptx which you can inspect. Naturally if your regular compile command requires extra switches (e.g -I/path/to/include/files) then this command may require those same switches. The nvcc manual provides more information on code generation options, and there is a PTX manual to help you learn about PTX, but you may be able to get a rough idea just based on the size of the generated PTX (e.g. number of lines in the .ptx file).

CUDA samples matrixMul error

I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.

How to pass a larger buffer size to DCPCrypt 'UpdateStream' Procedure

I have a program that currently hashes files using just SHA1. No other options. It hashes them using the SHA1 hash function that's part of the Lazarus and Free Pascal Compiler.
I've since added the ability to use MD5, SHA256 and SHA512 by using the DCPCrypt library (http://wiki.lazarus.freepascal.org/DCPcrypt or http://www.cityinthesky.co.uk/opensource). Everything is working fine, however, my earlier version hashed the file in 2Mb buffers if the file was larger than 1Mb. If it was smaller than 1Mb, it used the default buffer of 1024 bytes, like this :
if SizeOfFile > 1048576 then // if > 1Mb
begin
fileHashValue := SHA1Print(SHA1File(NameOfFileToHash, 2097152)); //2Mb buffer
end
else
fileHashValue := SHA1Print(SHA1File(NameOfFileToHash)); //1024 byte buffer
However, my hashing functions and procedures have now been moved to a single function controlled by a Radio button status to make my code more object orientated. It basically has all 4 hashing options coded within it, and which section is ran depends on which RadioButton.Checked status the program finds. The code of SHA1, for example, now looks like this :
..
SourceData := TFileStream.Create(FileToBeHashed, fmOpenRead);
..
else if SHA1RadioButton2.Checked = true then
begin
varSHA1Hash := TDCP_SHA1.Create(nil);
varSHA1Hash.Init;
varSHA1Hash.UpdateStream(SourceData, SourceData.Size); // HOW DO I ADD A BUFFER HERE?
varSHA1Hash.Final(DigestSHA1);
varSHA1Hash.Free;
for i := 0 to 19 do // 40 character output
GeneratedHash := GeneratedHash + IntToHex(DigestSHA1[i],2);
end // End of SHA1 if
My question is how do I add a buffer size to varSHA1Hash.UpdateStream if the file found is 'large' (say, bigger than 1Mb)? This is important because a 300Mb file, for example, takes 4 seconds with my earlier version and now it takes 9 seconds with my 'improved' version that utilises the DCPCrypt library! So it has doubled the time it takes for large files even though my code reads much better. If I can get varSHA1Hash.UpdateStream to read in data of several Mb at a time instead of 8k byte buffers (which the procedure UpdateStream does, if you read the code library) it will make it faster. As it stands, my understanding is that varSHA1Hash.UpdateStream(SourceData, SourceData.Size); basically reads the entire size of the file being read as the buffer?
If it helps, here is the UpdateStream procedure from
procedure TDCP_hash.UpdateStream(Stream: TStream; Size: longword);
var
Buffer: array[0..8191] of byte;
i, read: integer;
begin
dcpFillChar(Buffer, SizeOf(Buffer), 0);
for i:= 1 to (Size div Sizeof(Buffer)) do
begin
read:= Stream.Read(Buffer,Sizeof(Buffer));
Update(Buffer,read);
end;
if (Size mod Sizeof(Buffer))<> 0 then
begin
read:= Stream.Read(Buffer,Size mod Sizeof(Buffer));
Update(Buffer,read);
end;
end;
I have also looked at some other libraries, such as Delphi Encryption Compedium (http://home.netsurf.de/wolfgang.ehrhardt/crchash_en.html) and Wolfgang Ehrhardt library (http://www.torry.net/pages.php?id=519#939342) and also the one that is included with DoubleCommander, but for varios reasons (simplicty being one) I am trying to do this using DCPCrypt.
To answer your question: you cannot pass a different size but you can change the array size in dcpcrypt2.pas in the method you mentioned and recompile DCPCrypt, it is OSS after all.
But this will not help much because the sha1 unit of fpc is not faster because of the larger buffer size but because of a faster implementation of the sha1 algorithm, it makes use of the compiler intrinsics to rotate values which is an heavily used operation of the sha1 algorithm.
Just the following program with different numerical command line parameters (e.g. 8192 and 8388608):
uses
sysutils,sha1;
begin
writeln(SHA1Print(SHA1File('bigfile',StrToInt(paramstr(1)))));
end.
At least on my PC it makes no difference if the buffer is 8k or 8M. If you use lower values like 1024, you will see a slight slow down (10-20%).