How to port app from Borland Pascal to FreePascal and Unicode terminal - freepascal

I am trying to port my first app I ever wrote from old Borland Pascal to FreePascal and run it in Linux unicode shell.
Unfortunately, the app uses CRT unit and writes non-standard ASCII graphical characters. So I tried to rewrite statements like these:
gotoxy(2,3); write(#204);
writeln('3. Intro');
to these:
gotoxy(2,3); write('╠');
write('3. Intro', #10);
Two notes:
I use unicode characters directly in code because I did not find out how to write unicode characters via their code.
I used write procedure instead of writeln to make sure that unix line endings will be produced.
But after replacing all non-standard ASCII characters and getting rid of all writeln statements, it became even worse.
Before changes:
After changes:
Why it ends up like this? What I can do better?
After some time here is an update what I found out.
1) I cannot port it
As user #dmsc rightly pointed out, CRT does not support UTF-8. His suggested hack that did not work for me.
2) When you can't port it, emulate environment.
The graphical characters I needed were part of CP-437. There is a program called luit that is made for converting application output from the locale's encoding into UTF-8. Unfortunately this does not work for me. It simple erased the characters:
# Via iconv, everything is OK:
$ printf "top right corner in CP437: \xbf \n" | iconv -f CP437 -t UTF-8
top right corner in CP437: ┐
# But not via luit, that simply omit the character:
$ luit -gr g2 -g2 'CP 437' printf "top right corner in CP437: \xbf \n"
top right corner in CP437:
So my solution is to run gnome-terminal, add and set Hebrew (IBM862) encoding (tutorial here) and enjoy your app!

The CRT unit does not currently works with UTF-8, as it assumes that each character on the screen is exactly one byte, see http://www.freepascal.org/docs-html-3.0.0/rtl/crt/index.html
But, simple applications can be made to work by "tricking" GotoXY to always do a full cursor positioning, by doing:
GotoXY(1,1);
GotoXY(x, y);
To replace all the strings in your source file, you can use recode, in a terminal type:
recode cp437..u8 < original.pas > fixed.pas
Then, you need to replace all the numeric characters (like your #204 example) with the equivalent UTF-8, you can use:
echo -e '\xCC' | recode cp437/..u8
The 'CC' is hexadecimal for 204, and as a result the character '╠' will be printed.

Related

Convert UTF-8 to ANSI in tcl

proc pub:write { nick host handle channel arg } {
set fid [open /var/www/test.txt w]
puts $fid "█████████████████████████████████████████████████████████████████"
puts $fid "██"
close $fid
}
when i open i Webbrowser its Result so :
█████████████████████████████████████████████████████████████████
but it should :
█████████████████████████████████████████████████████████████████
Welcome to the yawning pit of complexity that is string encodings. You've got to get two things right to make what you're trying to do work. READ EVERYTHING BELOW BEFORE MAKING CHANGES as it all interacts horribly.
The character needs to be written to the file using the right encoding. This is done by configuring the encoding on the channel, which defaults to a system-specific value that is usually but not always right.
I'm taking a very wild guess that an encoding like “cp437 DOSLatinUS” is the right one.
fconfigure $fid -encoding cp437
However, Tcl's usually pretty good at picking the right thing to do by default.
Also, there's a huge number of different encodings. Some are very similar to each other and picking which one to use is a bit of a black art. The usual best bet is to stick with utf8 when possible, and otherwise to use the correct encoding (defined by protocol or by the system) and take a vast amount of care. This is really complicated!
You've also got to get the character into Tcl correctly in the first place. This means that the character has to be encoded in the source file, and Tcl has to read that file with the right encoding. Since the file is being written by another program (your editor usually) there's all sorts of potential for trouble. If you can discover what encoding is being used there (usually a matter of complete guesswork) then you can use the -encoding option to tclsh or source to allow Tcl to figure out what is going on.
Alternatively, stick with the ASCII subset in your source as that's pretty reliably handled the same whatever encoding is in use. You do this by converting each █ to the Tcl escape sequence \u2588. At least like that, you can be sure that you're only hunting down problems with the output encoding.
When debugging this thing, only change one thing at a time before retesting as there's a lot of bits that can go wrong and poison what is going on in ways that produce weird results downstream. I advise trying the escape sequence first as that at least means that you know that the input data is correct; once you know that you're not pushing garbage in, you can try hunting down whether you're actually getting problems with getting garbage out and what to do about it.
Finally, be aware that mixing in networking in this makes the problems about ten times harder…

MySQL on remote machine accessed via chromebook terminal returns nonsense unicode which persists after I leave MySQL

I am using the terminal in a chromebook to ssh into a remote server. When I run a MySQL (5.6) select query, sometimes one of the fields will return nonsense unicode (when the field should return an email address) and change the MySQL prompt from:
mysql>
to
└≤⎽─┌>
and whatever text I type is converted into weird unicode. The problem persists even after I exit MySQL
One of the values in your database happened to have the sequence of bytes 0x1B, 0x28, 0x30 (ESC ) 0) in it. When you did the query, MySQL printed this byte sequence directly to your console. You can reproduce the effect by typing from python:
>>> print '\x1B\x28\x30'
Consoles use control characters (in particular 0x1B, ESC) as a way to allow applications to control aspects of the console other than pure text, such as colours and cursor movements. This behaviour is inherited from the old dumb-terminal devices that they are pretending to be (which is why they are also known as terminal emulators), along with some weirder tricks that we probably don't need any more. One of those is to switch permanently between different character sets (considered encodings, now, but this long predates Unicode).
One of those alternative character sets is the DEC Special Graphics Character Set which it looks like you have here. In this character set the byte 0x6D, usually used in ASCII for m, comes out as the graphical character └.
You could in principle reset your terminal to normal ASCII by printing a byte sequence 0x1B, 0x28, 0x42 (ESC ) B), but this tends to be a pain to arrange when your console is displaying rubbish.
There are potentially other ways your console can become confused; it's not, in general safe to print arbitrary binary data to the console. There even used to be nastier things you could do with the console by faking keyboard input, which made this a security problem, but today it's just an annoyance factor.
However, one wouldn't normally expect to have any control codes in an e-mail address field. I suggest the application using the database should be doing some validation on the input it receives, and dropping or blocking all control codes (other than potentially newlines where necessary).
As a quick hack to clean this field for the specific case of the ESC character, you could do something like:
UPDATE things SET email=REPLACE(email, CHAR(0x1B), '');

what type of encoding is this "Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators"?

I am trying to export the out put of mssql query which must use utf (utf-16 I suppose) encoding according to description I am using -W -u
functions with sqlcmd.
Ä is converted to character z (with 2 dots or something like ^ inverted) by default, and lists it as ansi character set.
when I try to use notepad++ to convert this utf file to utf-8 shows me with some strange highlighted characters (x8E) for Ä, and some other for other characters like x86 and x94 does not matter what ever encoding I use as default in Nottepad++.
When I transfered the file to a Ubuntu 12.04 machine and using file command says that its
user#user:~/Desktop/enc
oding/checkencoding$ file
convertit4.csv convertit4.csv: Non-ISO extended-ASCII English
text, with very long lines, with CRLF line terminators
user#user:~/Desktop/encoding/checkencoding$ chardet
convertit4.csv convertit4.csv: ISO-8859-2 (confidence: 0.77)
I am confused what kind of encoding it uses.
the purpose is to convert it to utf-8 encoding without any errors to upload it to magmi importer.
note: I am using this command to remove the underline after the headers type c:\outfiles\convertit1.temp | findstr /r /v "^\-[;\-]*$" > c:\outfiles\convertit4.csv hope this line of codes is not the problem.
I hope the information are complete to solve this issue, If any more information needed, please let me know,
Regards.
Try the -f option as in
http://www.yaldex.com/sql_server_tutorial_3/ch06lev1sec1.html

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...