Excel does not display currency symbol(for example ¥) generated in my tcl code - html

I actually am generating an MS Excel file with the currencies and if you see the file I generated (tinyurl.com/currencytestxls), opening it in the text editor shows the correct symbol but somehow, MS Excel does not display the symbol. I am guessing there is some issue with the encoding. Any thoughts?
Here is my tcl code to generate the symbol:
set yen_val [format %c 165]

Firstly, this does produce a Yen symbol (I put format string in double quotes here just for clarity with the formatting):
format "%c" 165
You can then pass it around just fine. The problem is likely to come when you try to output it; when Tcl writes a string to the outside world (with the possible exception of the terminal on Windows, as that's tricky) it encodes that string into a definite byte sequence. The default encoding is the one reported by:
encoding system
But you can see what it is and change it for any channel (if you pass in the new name):
fconfigure $theChannel -encoding $theEncoding
For example, on my system (which uses UTF-8, which can handle any character):
% fconfigure stdout -encoding
utf-8
% puts [format %c 165]
¥
If you use an encoding that cannot represent a particular character, the replacement character for that encoding is used instead. For many encodings, that's a “?”. When you are sending data to another program (including to a web server or to a browser over the internet) it is vital that both sides agree on what the encoding of the data is. Sometimes this agreement is by convention (e.g., the system encoding), sometimes it is defined by the protocol (HTTP headers have this clearly defined), and sometimes this is done by explicitly transferred metadata (HTTP content).
If you're writing a CSV file to be ingested by Excel, use either the “unicode” or the “utf-8” encoding and make sure you put the byte-order mark in correctly. Tcl doesn't write BOMs automatically (because it's the wrong thing to do in some cases). To write a BOM, do this as the first thing when you start writing the file:
puts -nonewline $channel "\ufeff"

Related

Reading the wrong number of bytes from a binary file

I have the following code:
set myfile "the path to my file"
set fsize [file size $myfile]
set fp [open $myfile r]
fconfigure $fp -translation binary
set data [read $fp $fsize]
close $fp
puts $fsize
puts [string bytelength $data]
And it shows that the bytes read are different from the bytes requested. The bytes requested match what the filesystem shows; the actual bytes read are 22% more (requested 29300, got 35832). I tested this on Windows, with Tcl 8.6.
Use string length. Don't use string bytelength. It gives the “wrong” answers, or rather it answers a question you probably don't want to ask.
More Depth
The string bytelength command returns the length in bytes of the data in Tcl's internal almost-UTF-8 encoding. If you're not working with Tcl's C API directly, you really have no sensible use for that value, and C code is actually pretty able to get the value without that command. For ASCII text, the length and the byte-length are the same, but for binary data or text with NULs or characters greater than U+00007F (the Unicode character that is equivalent to ASCII DEL), the values will differ. By contrast, the string length command knows how to handle binary data correctly, and will report the number of bytes in the byte-string that you read in. We plan to deprecate the string bytelength command, as it turns out to be a bug in someone's code almost every time they use it.
(I'm guessing that your input data actually has 6532 bytes outside the range 1–127 in it; the other bytes internally use a two-byte representation in almost-UTF-8. Fortunately, Tcl doesn't actually convert into that format until it needs to, and instead uses a compact array of bytes in this case; you're forcing it by asking for the string bytelength.)
Background Information
The question of “how much memory is actually being used by Tcl to read this data” is quite hard to answer, because Tcl will internally mutate data to hold it in the form that is most efficient for the operations you are applying to it. Because Tcl's internal types are all precisely transparent (i.e., conversions to and from them don't lose information) we deliberately don't talk about them much except from an optimisation perspective; as a programmer, you're supposed to pretend that Tcl has no types other than string of unicode characters.
You can peel the veil back a bit with the tcl::unsupported::representation command (introduced in 8.6). Don't use the types for decisions on what to do in your code, as that is really not something guaranteed by the language, but it does let you see a lot more about what is really going on under the covers. Just remember, the values that you see are not the same as the values that Tcl's implementation thinks about. Thinking about the values that you see (without that magic command) will keep you thinking about things that it is correct to write.

Perl: Why do i need to set the latin1 flag explicitly since JSON 2.xx?

Since JSON 2.xx i need to set the latin1 flag in order to get umlauts safe to the html document:
my $obj_with_umlauts = {
title => 'geändert',
}
my $json = JSON->new()->latin1(1)->encode($obj_with_umlauts);
This was not necessary using JSON 1.xx :
my $json = JSON->new()->objToJson($obj_with_umlauts);
The html document is in iso-8559-1 (meta-tag).
Can anybody explain to me why?
This is such a huge can of worms that you're opening here.
I suspect that the answer is something along the lines of "a bug was fixed in the character handling of JSON.pm". But it's hard to know what is going on without a lot more information about your situation.
How is $string_with_umlauts being set? How are you encoding the data that you write to the HTML document?
Do you want to handle utf8 data correctly (you really should) or are you happy assuming that you live in a Latin1 world?
It's important to realise that if you completely ignore Unicode considerations then it can often seem that your programs are working correctly as errors often cancel each other out. When you start to address Unicode issues, it can seem that your programs are getting worse until you address all of the issues.
The Perl Unicode Tutorial is a good place to start learning about these things.
P.S. It's "Perl", not "PERL".
What are you talking about?
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->objToJson(["\xE4"]);
say sprintf "%v02X", $json;
'
1.15
5B.22.E4.22.5D # Unicode code points for ["ä"]
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # Unicode code points for ["ä"]
Those two strings are identical! In fact, adding ->latin1() doesn't change anything because the iso-8859-1 encoding of Unicode code point U+00E4 is E4.
$ perl -MJSON -E'
say $JSON::VERSION;
my $json = JSON->new()->latin1()->encode(["\xE4"]);
say sprintf "%v02X", $json;
'
2.59
5B.22.E4.22.5D # iso-8859-1 encoding of ["ä"]
There is one difference between the last two: it's stored differently in the scalar. That should make absolutely no difference. If code treats them differently, then that code is incorrectly reading the data in the scalar, and that code is buggy.
$string_with_umlauts definetly is a string in winLatin
Well, that's error number one.
JSON expects strings of decoded text (strings of Unicode code points), not encoded text.
That said, there happens to be no difference between a string encoded using iso-8859-1 and a string of Unicode code points. For example, when encoded using iso-8859-1, "ä" is byte E4, and it's Unicode code point U+00E4, two different notation for the same number.
If the string is encoded using cp1252, though, you'll have problems with characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ (the characters in cp1252 but not in iso-8859-1). For example, when encoded using cp1252, "€" is byte 80, but it's Unicode code point U+20AC. 0x80 != 0x20AC.
The html document is in iso-8559-1 (meta-tag).
Then at some point, you'll have to encode the output into iso-8859-1. You can do it using an :encoding layer, or using Encode's encode or using JSON's ->latin1 directive. The advantage of using this final option is that it will cause JSON to escape any character outside of the iso-8859-1 character set before attempting to encode it.
Can anybody explain to me why?
You have a code (an XS module) that reads the underlying string buffer of the scalar and incorrectly treats that as the content of the string. There is a bug is in that module.

how to get the real file contents using TFilestream?

i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.

Data Type of Module Output

I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:
open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";
while (<INPUT>) {
s/&uuml/ü/g;
}
print OUTPUT $_;
This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition. How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?
There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.
If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.
Then open the input file with an encoding layer like this:
open my $fh, "<:encoding(UTF-8)", $filename
The same goes for the output file. Just make sure to specify an an encoding when you want to use one.
You can change the encoding of a file with binmode, just see the documentation.
You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.
If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.