I'm having an enforced introduction to idl trying to debug some old code.
I have a binary image file that has an ascii header (It's a THEMIS IR BTR image of Mars, if that is of interest). The code opens the file as unit 1 using OPENR, then reads the first 256 bytes of it using ASSOC(1,BYTARR(256)). The return from that is 256 ascii character dex values, but they are mostly high or low numbers that do not correspond to alpha-numeric characters, and are not related to the header that I know is on the file.
One thing that may help with diagnostics: the original file is a g-zipped version of the file. If I try to open it directly (using less, for example) it allows me to read the header. But if I unzip it first (gzip -c filename.IMG.gz > filename.IMG) and then try to read it again I get binary gobbledegook. (less gives me a warning before opening: "filename.IMG may be a binary file. See it anyway?").
Any suggestions?
Here's the IDL code:
CLOSE,1
OPEN,1,FILENAME
A = ASSOC(1,BYTARR(256))
B = A[0]
print,'B - ',B
H = STRING(B)
print,'H - ',H
And this is what it gives me:
B - 31 139 8 8 7 17 238 79 0 3 ... (and on for 256 characters)
H - [Some weird symbol]
I've tried it on a purely ascii test file and it works as expected.
31 139 8 is the beginning of a GZIP header for a "deflated" file.
http://www.gzip.org/zlib/rfc-gzip.html#file-format
So yes, the file looks like it needs to be decompressed first.
Try decompressing the file with gunzip, and check the header again. If it is 31 139 08... again, it looks like it has been compressed twice.
Otherwise, whatever it is, it is likely that it's been finally decompressed. It remains to be seen why the uncompressed file isn't being decoded.
Try the COMPRESS keyword to OPEN:
openr, 1, filename, /compress
The COMPRESS keyword refers to a compressed file, so it is both for reading and writing compressed files.
Related
The Systemverilog code below is a single file testbench which reads a binary file into a memory using $fread then prints the memory contents. The binary file is 16 bytes and a view of it is included below (this is what I expect the Systemverilog code to print).
The output printed matches what I expect for the first 6 (0-5) bytes. At that point the expected output is 0x80, however the printed output is a sequence of 3 bytes starting with 0xef which are not in the stimulus file. After those 3 bytes the output matches the stimulus again. It seems as if when bit 7 of the binary byte read is 1, then the error occurs. It almost as if the data is being treated as signed, but it is not, its binary data printed as hex. The memory is defined as type logic which is unsigned.
This is similar to a question/answer in this post:
Read binary file data in Verilog into 2D Array.
However my code does not have the same issue (I use "rb") in the $fopen statement, so that
solution does not apply to this issue.
The Systemverilog spec 1800-2012 states in section 21.3.4.4 Reading binary data that $fread can be used to read a binary file and goes on to say how. I believe this example is compliant to what is stated in that section.
The code is posted on EDA Playground so that users can see it and run it.
https://www.edaplayground.com/x/5wzA
You need a login to run it and download. The login is free. It provides
access to full cloud-based versions of the industry standard tools for HDL simulation.
Also tried running 3 different simulators on EDA Playground. They all produce the same result.
Have tried re-arranging the stim.bin file so that the 0x80 value occurs at the beginning of the file rather than in the middle. In that case the error also occurs at the beginning of the testbench printing output.
Maybe the Systemverilog code is fine and the problem is the binary file? I have provided a screenshot of what emacs hexl mode shows for it's contents. Also viewed it another viewer and it looked the same. You can download it when running on EDA Playground to examine it in another editor. The binary file was generated by GNU Octave.
Would prefer to have a solution which uses Systemverilog $fread rather than something else in order to debug the original rather than work around it (learning). This will be developed into a Systemverilog testbench which applies stimulus read from a binary file generated in Octave/Matlab to a Systemverilog DUT. Binary fileIO is prefered because of the file access speed.
Why does the Systemverilog testbench print 0xef rather than 0x80 for mem[6]?
module tb();
// file descriptors
int read_file_descriptor;
// memory
logic [7:0] mem [15:0];
// ---------------------------------------------------------------------------
// Open the file
// ---------------------------------------------------------------------------
task open_file();
$display("Opening file");
read_file_descriptor=$fopen("stim.bin","rb");
endtask
// ---------------------------------------------------------------------------
// Read the contents of file descriptor
// ---------------------------------------------------------------------------
task readBinFile2Mem ();
int n_Temp;
n_Temp = $fread(mem, read_file_descriptor);
$display("n_Temp = %0d",n_Temp);
endtask
// ---------------------------------------------------------------------------
// Close the file
// ---------------------------------------------------------------------------
task close_file();
$display("Closing the file");
$fclose(read_file_descriptor);
endtask
// ---------------------------------------------------------------------------
// Shut down testbench
// ---------------------------------------------------------------------------
task shut_down();
$stop;
endtask
// ---------------------------------------------------------------------------
// Print memory contents
// ---------------------------------------------------------------------------
task printMem();
foreach(mem[i])
$display("mem[%0d] = %h",i,mem[i]);
endtask
// ---------------------------------------------------------------------------
// Main execution loop
// ---------------------------------------------------------------------------
initial
begin :initial_block
open_file;
readBinFile2Mem;
close_file;
printMem;
shut_down;
end :initial_block
endmodule
Binary Stimulus File:
Actual output:
Opening file
n_Temp = 16
Closing the file
mem[15] = 01
mem[14] = 00
mem[13] = 50
mem[12] = 60
mem[11] = 71
mem[10] = 72
mem[9] = 73
mem[8] = bd
mem[7] = bf
mem[6] = ef
mem[5] = 73
mem[4] = 72
mem[3] = 71
mem[2] = 60
mem[1] = 50
mem[0] = 00
Update:
An experiment was run in order to test that the binary file may be getting modified during the process of uploading to EDA playground. There is no Systemverilog code involved in these steps, it's just a file upload/download.
Steps:
(Used https://hexed.it/ to create and view the binary file)
Create/save binary file with the hex pattern 80 00 80 00 80 00 80 00
Create new playground
Upload new created binary file to the new playground
Check the 'download files after run' box on the playground
Save playground
Run playground
Save/unzip the results from the playground run
View the binary file, in my case it has been modified during the process of
upload/download. A screenshot of the result is shown below:
This experiment was conducted on two different Windows workstations.
Based on these results and the comments I am going to close this issue, with the disposition that this is not a Systemverilog issue, but is related to upload/dowload of binary files to EDA playground. Thanks to those who commented.
The unexpected output produced by the testbench is due to modifications that occur to the binary stimulus file during/after upload to EDA playground. The Systemverilog testbench performs as intended to print the contents of the binary file.
This conclusion is based on community comments and experimental results which are provided at the end of the updated question. A detailed procedure is given so that others can repeat the experiment.
I have several huge (>2GB) JSON files that end in ,\n]. Here is my test file example, which is the last 25 characters of a 2 GB JSON file:
test.json
":{"value":false}}}}}},
]
I need to delete the ,\n and add back in the ] from the last three characters of the last line. The entire file is on three lines: both the front and end brackets are on their own line, and all the contents of the JSON array is on the second line.
I can't load the entire stream into memory to do something like:
string[0..-2]
because the file is way too large. I tried several approaches, including Ruby's:
chomp!(",\n]")
and UNIX's:
sed
both of which made no change to my JSON file. I viewed the last 25 characters by doing:
tail -c 25 filename.json
and also did:
ls -l
to verify that the byte size of the new and the old file versions were the same.
Can anyone help me understand why none of these approaches is working?
It's not necessary to read in the whole file if you're looking to make a surgical operation like this. Instead you can just overwrite the last few bytes in the file:
file = 'huge.json'
IO.write(file, "\n]\n", File.stat(file).size - 5)
The key here is to write as many bytes out as you back-track from the end, otherwise you'll need to trim the file length, though you can do that as well if necessary with truncate.
How can one detect the type of compression used on the file? (assuming that .zip, .gz, .xz or any other extension is not specified).
Is this information stored somewhere in the header of that file?
You can determine that it is likely to be one of those formats by looking at the first few bytes. You should then test to see if it really is one of those, using an integrity check from the associated utility for that format, or by actually proceeding to decompress.
You can find the header formats in the descriptions:
Zip (.zip) format description, starts with 0x50, 0x4b, 0x03, 0x04 (unless empty — then the last two are 0x05, 0x06 or 0x06, 0x06)
Gzip (.gz) format description, starts with 0x1f, 0x8b, 0x08
xz (.xz) format description, starts with 0xfd, 0x37, 0x7a, 0x58, 0x5a, 0x00
Others:
zlib (.zz) format description, starts with two bytes (in bits) 0aaa1000 bbbccccc, where ccccc is chosen so that the first byte viewed as a int16 times 256 plus the second byte viewed as a int16 is a multiple of 31. e.g: 01111000(bits) = 120(int16), 10011100(bits) = 156(int16), 120 * 256 + 156 = 30876 which is a multiple of 31
compress (.Z) starts with 0x1f, 0x9d
bzip2 (.bz2) starts with 0x42, 0x5a, 0x68
Zstandard (.zstd) format description, frame starts with a 4 byte magic number using little-endian format 0xFD2FB528, a skipable frame starts with 0x184D2A5? (question mark is any value from 0 to F), and dictionary starts with 0xEC30A437.
A few more formats in the magic database from the file command
If you're on a Linux box just use the 'file' command.
http://en.wikipedia.org/wiki/File_(command)
$ mv foo.zip dink
$ file dink
dink: gzip compressed data, from Unix, last modified: Sat Aug 6 08:08:57 2011,
max compression
$
As an alternative to inspecting the file header by hand, you could use some utility like TrID. The link points to the cross-platform command line version; for Windows there's a GUI, too.
If you want to determine an algorithm used to compress a linux kernel, there is a script for that, see this question and answer: https://unix.stackexchange.com/a/553192/264065
A simple implementation of gzip compression checking in golang
func IsGzipCompressed(data []byte) bool {
gzipHeaderSize := 10
if len(data) < gzipHeaderSize {
return false
}
gzipHeaderMagicNumber := []byte{0x1f, 0x8b}
if bytes.Equal(data[:2], gzipHeaderMagicNumber) {
return true
}
return false
}
i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.
I'm trying to return json content read from MySQL server. This is supposed to be easy but, there is a 'weird' character that keeps appearing at start of the content.
I have two pages for returning content:
kcb433.sytes.net/as/test.php?json=true&limit=6&input=d
this test.php is from a script written by Timothy Groves, which converts an array to json output
http://kcb433.sytes.net/k.php?k=4
this one is supposed to do the same
I tried to validate it here jsonformatter.curiousconcept.com but just page 1 gets validated, page 2 says that it does not contain JSON data.
If accessed directly both pages has no problems. Then what is the difference, why both don't get validated?
Then I found this page jsonformat.com and tried the same thing. Page 1 was ok and page 2 wasn't but, surprisingly the data could be read. At a glance,
{"a":"b"}
may look good but there is a character in front.
According to a hex editor online, this is the value of the string above (instead of 9 values, there are 10):
-- 7B 22 61 22 3A 22 62 22 7D
The code to echo json in page 2 is:
header("Content-Type: application/json");
echo "{\"a\":\"b\"}";
Your k.php file has BOM signature at the start, save k.php again with UTF8 without BOM.