Put Hex Into .bin File - binary

How would you stick a string of hexcode into a .bin file? Like this, \x45\x67\x89 for example. I've seen the long examples where you use bash to strip it then add it to the .bin, but there must be a quicker and simpler way?
Also, I am not too familiar with .bin's, are they a program in themselves?

printf is function wide supported function. C, cpp, php, python, bash...
so classic implementation in C would be:
FILE *fp =fopen('binfilename.bin', 'w');
fprintf(fp, "\x45\x67\x89"); fclose(fp);
all other languages have similar usage.
you mention bash, and i think there is no simpler way than bash itself:
printf "\x45\x67\x89" > binfilename.bin
Every file is binary file. If it contains just printable bytes, we call it textual file. If it is generated by compiler and have bytes meaningfull to cpu, not to human, than we say it is 'binary', program. But both textual and binary contains bytes and are binaries. Difference is after, when we/some app interprets it's contents.

Related

Pari/Gp directing output

Is there an easy convenient way to direct output in Pari/GP to file? My aim is to get the full decimal expansion of 2^400000-1 either on screen or in a text file?
(23:37) gp > 2^400000-1
%947 = 996014342993......(4438 digits)......609762267975[+++]
GP terminal output gives this, which is not the goal. Basic output re-direction does not work either. Any ideas? Thanks.
(23:38) gp > 2^400000-1 > output.txt
There is a manual online, it does not say much about the output, except for the variable TeXstyle. I am unsure how to work with this though.
Quick and easy is to just do print(2^400000-1) and then you can cut+paste. Otherwise write(filename, 2^400000-1) if you want in a file.
Some other possibilities:
writebin(filename,2^400000-1) writes the object binary structure in a file: this is faster than traditional output (which implies a binary to decimal conversion), and loading it into another session will be faster as well. This is useful for a huge atomic write.
C-style output: fileopen, then successive filewrite allows many writes to a file referenced by a descriptor (which avoids re-opening / flushing / closing the file after each write). This is useful for a large write operation done through many tiny writes to a given file, e.g, character by character.

Reading the wrong number of bytes from a binary file

I have the following code:
set myfile "the path to my file"
set fsize [file size $myfile]
set fp [open $myfile r]
fconfigure $fp -translation binary
set data [read $fp $fsize]
close $fp
puts $fsize
puts [string bytelength $data]
And it shows that the bytes read are different from the bytes requested. The bytes requested match what the filesystem shows; the actual bytes read are 22% more (requested 29300, got 35832). I tested this on Windows, with Tcl 8.6.
Use string length. Don't use string bytelength. It gives the “wrong” answers, or rather it answers a question you probably don't want to ask.
More Depth
The string bytelength command returns the length in bytes of the data in Tcl's internal almost-UTF-8 encoding. If you're not working with Tcl's C API directly, you really have no sensible use for that value, and C code is actually pretty able to get the value without that command. For ASCII text, the length and the byte-length are the same, but for binary data or text with NULs or characters greater than U+00007F (the Unicode character that is equivalent to ASCII DEL), the values will differ. By contrast, the string length command knows how to handle binary data correctly, and will report the number of bytes in the byte-string that you read in. We plan to deprecate the string bytelength command, as it turns out to be a bug in someone's code almost every time they use it.
(I'm guessing that your input data actually has 6532 bytes outside the range 1–127 in it; the other bytes internally use a two-byte representation in almost-UTF-8. Fortunately, Tcl doesn't actually convert into that format until it needs to, and instead uses a compact array of bytes in this case; you're forcing it by asking for the string bytelength.)
Background Information
The question of “how much memory is actually being used by Tcl to read this data” is quite hard to answer, because Tcl will internally mutate data to hold it in the form that is most efficient for the operations you are applying to it. Because Tcl's internal types are all precisely transparent (i.e., conversions to and from them don't lose information) we deliberately don't talk about them much except from an optimisation perspective; as a programmer, you're supposed to pretend that Tcl has no types other than string of unicode characters.
You can peel the veil back a bit with the tcl::unsupported::representation command (introduced in 8.6). Don't use the types for decisions on what to do in your code, as that is really not something guaranteed by the language, but it does let you see a lot more about what is really going on under the covers. Just remember, the values that you see are not the same as the values that Tcl's implementation thinks about. Thinking about the values that you see (without that magic command) will keep you thinking about things that it is correct to write.

What does "the composition of UNIX byte streams" mean?

In the opening page of the book of "Lisp In Small Pieces", there is a paragraph goes like this:
Based on the idea of "function", an idea that has matured over
several centuries of mathematical research, applicative languages are
omnipresent in computing; they appear in various forms, such as the
composition of Un*x byte streams, the extension language for the Emacs
editor, as well as other scripting languages.
Can anyone elaborate a bit on "the composition of unix byte streams"? What does it mean? and how it is related to applicative/functional programming?
Thanks,
/bruin
My guess is that this is a reference to something like a pipe under linux.
cal | wc
the symbol | it's what invokes a pipe between 2 applications, a pipe is a feature provided by the kernel so you can use pipes where the applications are written using this kind of kernel APIs.
In this example cal is just the utility that prints a calendar, wc is an utility that counts words, rows and columns in the input that you pass to it, in this case the input is the result of piping cal to wc which makes things easier for you because it's more functional, you only care about what each applications does, you don't care, for example, about what is the name of the argument or where to allocate a temporary file to store the input/output in between.
Without the pipes you should do something like
cal > temp.txt
wc temp.txt
rm temp.xt
to obtain pretty much the same information. Also this second solution could possibly generate problems, for example what if temp.txt already exists ? Following what kind of rationale you will tell to your script to pick a name for your temporary file ? What if another process modifies your file in between the 2 calls to cal and wc ?

how to get the real file contents using TFilestream?

i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.

Any good tool to convert HTML entities in HTML documents to plain UTF characters?

I have many HTML documents containing many HTML entities of Unicode code point representation, e.g. بروح
Is there a good tool to convert HTML entities in multiple HTML documents to plain UTF-8/UTF-16/UTF-32 characters?
I want an offline converter tool that can do a batch job for this purpose.
I don't know of such a tool, but you could easily write one. This C# code for example would convert all html files in the current folder:
foreach (string name in Directory.GetFiles(".", "*.html")) {
string s = File.ReadAllText(name);
s = Regex.Replace(
s,
#"&#(\d+);",
m => ((char)Int32.Parse(m.Groups[1].Value)).ToString()
);
File.WriteAllText(name, s);
}
The GNU utility "recode" will do this, with the invocation
recode HTML..UTF-16LE < old.html > new.html
(or UTF-16BE, of course.)
http://ftp.gnu.org/gnu/recode/recode-3.6.tar.gz
It's use of HTML as a character set is a bit of a hack and is treated as either ASCII or LATIN-1, when it should be treated as a "surface" for any character set. If there are any UTF-8 characters, it can break, so I'm now withdrawing my recommendation. Use the first.
(You might expect recode UTF-8..HTML,HTML..UTF-16LE to work, but this first encodes the ampersands...)