I have two BMP files, a windows screenshot and a linux generated file using GIMP.
What I noticed is that all the data in the headers is stored in big endian format.
The biWidth, biHeight and biPlanes fields of the DIB header are all in big endian, also "the size of the BMP file in bytes" (the second field from Bitmap File Header) is big endian, which contradicts wikipedia, where it says: "All of the integer values are stored in little-endian format"
I looked into GIMP's source code and I found a function that converts data from little to big endian:
https://git.gnome.org/browse/gimp/tree/plug-ins/file-bmp/bmp-write.c#n81
That FromL function is used to write the file size in bytes in the Bitmap File Header:
https://git.gnome.org/browse/gimp/tree/plug-ins/file-bmp/bmp-write.c#n431
So everything is in big endian, the question is why?
Why would one want to convert to big endian on write and to convert from big to little endian when reading, when one could simply read and write that data in little endian?
What am I missing?
Related
When you get down to the bare metal, all data is stored in bits, which are binary (1 or 0). However, I sometimes see terms like "binary file" which implies the existence of files that aren't binary. Also, for things like base64 encoding, which Wikipedia describes as a "binary-to-text encoding scheme". But if I'm not mistaken, text is also stored in a binary format on the hardware, so isn't base64 encoding ultimately converting binary to binary? Is there some other definition of "binary" I am unaware of?
You are right that deep down, everything is a binary file. However at its base, a binary file is intended to be read as an array of bytes, where each byte has a value between 0 and 255. A text file is intended to be read as an array of characters.
When, in Python, I open a file with open("myfile", "r"), I am telling it that I expect the underlying file to contain characters, and that Python just do the necessary processing to give me characters. It may convert multiple bytes into a single characters. It may canonicalize all possible newline combinations into just a single newline character. Some characters have multiple byte representations, but all will give me the same character.
When I open a file with open("myfile", "rb"), I literally want the file read byte by byte, with no interpretation of what it is seeing.
I read a file in action script that kind of files maybe in ASCII or binary format.
How can i check which format is used when read?
regards.
You could read the byte values from the file and try to make the guesses based on the byte values that you read. If you read a lot of bytes with values from 65-128 (approximately), it's very likely a "ASCII" format. Check www.asciitable.com and pick the ascii codes that you expect will and will not appear in an ASCII/binary file.
i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.
There is a MySQL backup file which is a huge file - about 3 GB. There is one table that has a LONGBLOB column that stores JPEG image data.
The file imports successfully if done from MySQL Workbench - Data Import/Restore.
I need to open this file and extract the first few lines (about two rows of INSERTs of the table with the image data) so that I can test if another program can import this data into another MySQL database.
I tried opening the file with EmEditor (which is good at opening large files) and then copy/paste only upto one Insert statement of the script into a new file (upto about line 25, because the table in question is the first table in the backup script), and then Paste the selection into a new file.
Here comes the problem:
However this messes up the encoding (even though I save as utf8). I realize this when I try to import (restore) this new file (again using MySQL Workbench) into a MySQL database, the restore goes ahead without errors, but the JPEG images in the blob column are now destroyed/corrupted.
My guess is that the encoding is different between the original file and new file.
EmEditor does not show the encoding on the original file, there is an option to detect, and it detects it as 'UTF8 Unsigned'. But when saving I save it as UTF8. I tried also saving as ANSI, ISO8859 (windows default), etc, etc.. but everytime the same result.
Do you have any solution for this particular problem? ie I want to only cut the first few lines of the huge backup file and save to a new file keeping the encoding the same, so that the images (blobs) are not changed. Is there any way this can be done with EmEditor (ie do I have the wrong approach [ie Cut-Paste]?) Is there any specialized software that can do this? How can I diagnose what is going wrong here?
Thanks for any responses.
this messes up the encoding (even though I save as utf8)
UTF-8 is not a good choice for arbitrary binary data. There are many sequences of high-bytes which are not valid in UTF-8, so you will mangle them at some point during the load-alter-save process.
If you load the file using an encoding that maps every single byte to a unique character, and re-save the file using that same encoding, you should preserve the original content(*). ISO-8859-1 is the encoding usually chosen for this purpose, since it simply maps each byte 0..0xFF to the Unicode code point with the same number.
(*: assuming the editor is binary-safe with regard to other tricky points like nulls, \n/\r and other control characters... I believe EmEditor can be.)
When opening the original file in EmEditor, trying selecting the encoding as Binary (ASCII View). The Binary (ASCII View) will, as bobince said, map each byte to a unique character and preserve that when you save the file. I think this should fix your problem.
Many languages have functions which only process "plaintext", not binary. Does this mean that only characters within the ASCII range will be allowed?
Binary is just a series of bytes, isn't it similar to plaintext which is just a series of bytes interpreted as characters? So, can plaintext store the same data formats / protocols as binary?
a plain text is human readable, a binary file is usually unreadable by a human, since it's composed of printable and non-printable characters.
Try to open a jpeg file with a text editor (e.g. notepad or vim) and you'll understand what I mean.
A binary file is usually constructed in a way that optimizes speed, since no parsing is needed.
A plain text file is editable by hand, a binary file not.
"Plaintext" can have several meanings.
The one most useful in this context is that it is merely a binary files which is organized in byte sequences that a particular computers system can translate into a finite set of what it considers "text" characters.
A second meaning, somewhat connected, is a restriction that said system should display these "text characters" as symbols readable by a human as members of a recognizable alphabet. Often, the unwritten implication is that the translation mechanism is ASCII.
A third, even more restrictive meaning, is that this system must be a "simple" text editor/viewer. Usually implying ASCII encoding. But, really, there is VERY little difference between you, the human, reading text encoded in some funky format and displayed by a proprietary program, vs. VI text editor reading ASCII encoded file.
Within programming context, your programming environment (comprized by OS + system APIs + your language capabilities) defines both a set of "text" characters, and a set of encodings it is able to read to convert to these "text" characters. Please note that this may not necessarily imply ASCII, English, or 8 bits - as an example, Perl can natively read and use the full Unicode set of "characters".
To answer your specific question, you can definitely use "character" strings to transmit arbitrary byte sequences, with the caveat that string termination conventions must apply.
The problem is that the functions that already exist to "process character data" would probably not have any useful functionality to deal with your binary data.
One thing it often means is that the language might feel free to interpret certian control characters, such as the values 10 or 13, as logical line terminators. In other words, an output operation might automagicly append these characters at the end, and an input operation might strip them from the input (and/or terminate reading there).
In contrast, language I/O operations that advertise working on "binary" data will usually include an input parameter for the length of data to operate on, since there is no other way (short of reading past end of file) to know when it is done.
Generally, it depends on the language/environment/functionality.
Binary data is always that: binary. It is transferred without modification.
"Plain text" mode may mean one or more of the following things:
the stream of bytes is split into lines. The line delimiters are \r, \n, or \r\n, or \n\r. Sometimes it is OS-dependent (like *nix likes \n, while windows likes \r\n). The line ending may be adjusted for the reading application
character encoding may be adjusted. The environment might detect and/or convert the source encoding into the encoding the application expects
probably some other conversions should be added to this list, but I can't think of any more at this moment
Technically nothing. Plain text is a form of binary data. However a major difference is how values are stored. Think of how an integer might be stored. In binary data it would use a two's complement format, probably taking 32 bits of space. In text format a number would be stored instead as a series of unicode digits. So the number 50 would be stored as 0x32 (padded to take up 32 bits) in binary but would be stored as '5' '0' in plain text.